Skip to main content

Data Layer


Introduction

The data layer includes the following parts.

  • Data Preparation
  • Data API
  • Data Loader
  • Data Handler
  • Dataset
  • Cache
  • Data and Cache File Structure

Here is a typical data workflow in Qlib.

  • Download data and convert data into Qlib format(with filename suffix .bin).
  • [Data Handler] creating some basic features based on Qlib’s expression Engine(e.g. “Ref($close, 60) / $close”, the return of last 60 trading days). Supported operators in the expression engine can be found hereopen in new window.
  • [Data Handler] If users require more complicated data processing, (e.g. normalization, filling NA values, etc.), they can define their data processors and add them to the data handler. Predefined data processors can be found hereopen in new window
  • At last, the data handler will return a dataset, which can be used in the model training process.

Data preparation

  • Automatic update of daily frequency data
  • Converting CSV Format into Qlib Format
    # for daily data
    python scripts/get_data.py download_data --file_name csv_data_cn.zip --target_dir ~/.qlib/csv_data/cn_data
    # for 1 min data
    python scripts/data_collector/yahoo/collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1min --region CN --start 2021-05-20 --end 2021-05-23 --delay 0.1 --interval 1min --limit_nums 10
    

Data API

  • Retrieval
  • Feature
  • Filter
    • NameDFilter
    • ExpressionDFilter
      • basic features filter: rule_expression = ‘close/close/open>5’
      • cross-sectional features filter : rule_expression = ‘rank(rank(close)<10’
      • time-sequence features filter: rule_expression = ‘Ref(Ref(close, 3)>100’
    filter: &filter
        filter_type: ExpressionDFilter
        rule_expression: "Ref($close, -2) / Ref($close, -1) > 1"
        filter_start_time: 2010-01-01
        filter_end_time: 2010-01-07
        keep: False
    
    data_handler_config: &data_handler_config
        start_time: 2010-01-01
        end_time: 2021-01-22
        fit_start_time: 2010-01-01
        fit_end_time: 2015-12-31
        instruments: *market
        filter_pipe: [*filter]
    

Data Loader

  • Data Loader in Qlib is designed to load raw data from the original data source. It will be loaded and used in the Data Handler module.

  • QlibDataLoader

    qdl = QlibDataLoader(config=(["$close / Ref($close, 5)", "$close", "Ref($close, 5)"])), 
    qdl.load(instruments=["sh600519"], start_time="20190101", end_time="20191231")
    
  • To know more about Data Loader, please refer to Data Loader APIopen in new window.

Data Handler

  • The Data Handler module in Qlib is designed to handler those common data processing methods which will be used by most of the models.
  • Users can use Data Handler in an automatic workflow by qrun, refer to Workflow: Workflow Managementopen in new window for more details.
  • DataHandlerLP(DataHandler with (L)earnable (P)rocessor)
    dh = DataHandlerLP(
        instruments=["sh600519"],
        start_time="20170101",
        end_time="20191231",
        infer_processors=[
            ZScoreNorm(fit_start_time="20170101", fit_end_time="20181231"),
            Fillna(),
        ],
        data_loader=qdl,
    )
    

Processor

DropnaProcessor: a processor that drops N/A features.

DropnaLabel: a processor that drops N/A labels.

TanhProcess: a processor that uses tanh to process noise data.

ProcessInf: a processor that handles infinity values, it will be replaced by the mean of the column.

Fillna: a processor that handles N/A values, which will fill the N/A value by 0 or another given number.

MinMaxNorm: a processor that applies min-max normalization.

ZscoreNorm: a processor that applies z-score normalization.

RobustZScoreNorm: a processor that applies robust z-score normalization.

CSZScoreNorm: a processor that applies cross-sectional z-score normalization.

CSRankNorm: a processor that applies cross-sectional rank normalization.

CSZFillna: a processor that fills N/A values in a cross-sectional way by the mean of the column.

Users can also define their processors by inheriting the base class Processor. See Processor APIopen in new window for more details.

Dataset

The Dataset module in Qlib aims to prepare data for model training and inferencing.

Cache

The Cache is an optional module that helps accelerate providing data by saving some frequently-used data as a cache file. Qlib provides a Memcache class to cache the most frequently used data in memory, an inheritable ExpressionCache class, and an inheritable DatasetCache class.

  • Global Memory Cache
  • ExpressionCache
  • DatasetCache

Data and Cache File Structure

See the paperopen in new window for more details.