Data Layer

Introduction

The data layer includes the following parts.

Data Preparation
Data API
Data Loader
Data Handler
Dataset
Cache
Data and Cache File Structure

Here is a typical data workflow in Qlib.

Download data and convert data into Qlib format(with filename suffix .bin).
[Data Handler] creating some basic features based on Qlib’s expression Engine(e.g. “Ref($close, 60) / $close”, the return of last 60 trading days). Supported operators in the expression engine can be found hereopen in new window.
[Data Handler] If users require more complicated data processing, (e.g. normalization, filling NA values, etc.), they can define their data processors and add them to the data handler. Predefined data processors can be found hereopen in new window
At last, the data handler will return a dataset, which can be used in the model training process.

Data preparation

Automatic update of daily frequency data

Converting CSV Format into Qlib Format

# for daily data
python scripts/get_data.py download_data --file_name csv_data_cn.zip --target_dir ~/.qlib/csv_data/cn_data
# for 1 min data
python scripts/data_collector/yahoo/collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1min --region CN --start 2021-05-20 --end 2021-05-23 --delay 0.1 --interval 1min --limit_nums 10

Data API

Retrieval
Feature
- ExpressionOpsopen in new window

Filter

NameDFilter
ExpressionDFilter
- basic features filter: rule_expression = ‘ $close/$ open>5’
- cross-sectional features filter : rule_expression = ‘ $rank($ close)<10’
- time-sequence features filter: rule_expression = ‘ $Ref($ close, 3)>100’

filter: &filter
    filter_type: ExpressionDFilter
    rule_expression: "Ref($close, -2) / Ref($close, -1) > 1"
    filter_start_time: 2010-01-01
    filter_end_time: 2010-01-07
    keep: False

data_handler_config: &data_handler_config
    start_time: 2010-01-01
    end_time: 2021-01-22
    fit_start_time: 2010-01-01
    fit_end_time: 2015-12-31
    instruments: *market
    filter_pipe: [*filter]

Data Loader

Data Loader in Qlib is designed to load raw data from the original data source. It will be loaded and used in the Data Handler module.

QlibDataLoader

qdl = QlibDataLoader(config=(["$close / Ref($close, 5)", "$close", "Ref($close, 5)"])), 
qdl.load(instruments=["sh600519"], start_time="20190101", end_time="20191231")

To know more about Data Loader, please refer to Data Loader APIopen in new window.

Data Handler

The Data Handler module in Qlib is designed to handler those common data processing methods which will be used by most of the models.
Users can use Data Handler in an automatic workflow by qrun, refer to Workflow: Workflow Managementopen in new window for more details.

DataHandlerLP(DataHandler with (L)earnable (P)rocessor)

dh = DataHandlerLP(
    instruments=["sh600519"],
    start_time="20170101",
    end_time="20191231",
    infer_processors=[
        ZScoreNorm(fit_start_time="20170101", fit_end_time="20181231"),
        Fillna(),
    ],
    data_loader=qdl,
)

Processor

DropnaProcessor: a processor that drops N/A features.

DropnaLabel: a processor that drops N/A labels.

TanhProcess: a processor that uses tanh to process noise data.

ProcessInf: a processor that handles infinity values, it will be replaced by the mean of the column.

Fillna: a processor that handles N/A values, which will fill the N/A value by 0 or another given number.

MinMaxNorm: a processor that applies min-max normalization.

ZscoreNorm: a processor that applies z-score normalization.

RobustZScoreNorm: a processor that applies robust z-score normalization.

CSZScoreNorm: a processor that applies cross-sectional z-score normalization.

CSRankNorm: a processor that applies cross-sectional rank normalization.

CSZFillna: a processor that fills N/A values in a cross-sectional way by the mean of the column.

Users can also define their processors by inheriting the base class Processor. See Processor APIopen in new window for more details.

Dataset

The Dataset module in Qlib aims to prepare data for model training and inferencing.

Cache

The Cache is an optional module that helps accelerate providing data by saving some frequently-used data as a cache file. Qlib provides a Memcache class to cache the most frequently used data in memory, an inheritable ExpressionCache class, and an inheritable DatasetCache class.

Global Memory Cache
ExpressionCache
DatasetCache

Data and Cache File Structure

See the paperopen in new window for more details.

Data Layer

# Introduction

# Data preparation

# Data API

# Data Loader

# Data Handler

# Processor

# Dataset

# Cache

# Data and Cache File Structure