Data Layer
Introduction
The data layer includes the following parts.
- Data Preparation
- Data API
- Data Loader
- Data Handler
- Dataset
- Cache
- Data and Cache File Structure
Here is a typical data workflow in Qlib.
- Download data and convert data into Qlib format(with filename suffix
.bin
). - [Data Handler] creating some basic features based on Qlib’s expression Engine(e.g. “Ref($close, 60) / $close”, the return of last 60 trading days). Supported operators in the expression engine can be found here.
- [Data Handler] If users require more complicated data processing, (e.g. normalization, filling NA values, etc.), they can define their data processors and add them to the data handler. Predefined data processors can be found here
- At last, the data handler will return a dataset, which can be used in the model training process.
Data preparation
- Automatic update of daily frequency data
- Converting CSV Format into Qlib Format
# for daily data python scripts/get_data.py download_data --file_name csv_data_cn.zip --target_dir ~/.qlib/csv_data/cn_data # for 1 min data python scripts/data_collector/yahoo/collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1min --region CN --start 2021-05-20 --end 2021-05-23 --delay 0.1 --interval 1min --limit_nums 10
Data API
- Retrieval
- Feature
- Filter
- NameDFilter
- ExpressionDFilter
- basic features filter: rule_expression = ‘open>5’
- cross-sectional features filter : rule_expression = ‘close)<10’
- time-sequence features filter: rule_expression = ‘close, 3)>100’
filter: &filter filter_type: ExpressionDFilter rule_expression: "Ref($close, -2) / Ref($close, -1) > 1" filter_start_time: 2010-01-01 filter_end_time: 2010-01-07 keep: False data_handler_config: &data_handler_config start_time: 2010-01-01 end_time: 2021-01-22 fit_start_time: 2010-01-01 fit_end_time: 2015-12-31 instruments: *market filter_pipe: [*filter]
Data Loader
Data Loader
inQlib
is designed to load raw data from the original data source. It will be loaded and used in theData Handler
module.QlibDataLoader
qdl = QlibDataLoader(config=(["$close / Ref($close, 5)", "$close", "Ref($close, 5)"])), qdl.load(instruments=["sh600519"], start_time="20190101", end_time="20191231")
To know more about
Data Loader
, please refer to Data Loader API.
Data Handler
- The
Data Handler
module inQlib
is designed to handler those common data processing methods which will be used by most of the models. - Users can use
Data Handler
in an automatic workflow byqrun
, refer to Workflow: Workflow Management for more details. - DataHandlerLP(DataHandler with (L)earnable (P)rocessor)
dh = DataHandlerLP( instruments=["sh600519"], start_time="20170101", end_time="20191231", infer_processors=[ ZScoreNorm(fit_start_time="20170101", fit_end_time="20181231"), Fillna(), ], data_loader=qdl, )
Processor
DropnaProcessor
: a processor that drops N/A features.
DropnaLabel
: a processor that drops N/A labels.
TanhProcess
: a processor that uses tanh to process noise data.
ProcessInf
: a processor that handles infinity values, it will be replaced by the mean of the column.
Fillna
: a processor that handles N/A values, which will fill the N/A value by 0 or another given number.
MinMaxNorm
: a processor that applies min-max normalization.
ZscoreNorm
: a processor that applies z-score normalization.
RobustZScoreNorm
: a processor that applies robust z-score normalization.
CSZScoreNorm
: a processor that applies cross-sectional z-score normalization.
CSRankNorm
: a processor that applies cross-sectional rank normalization.
CSZFillna
: a processor that fills N/A values in a cross-sectional way by the mean of the column.
Users can also define their processors by inheriting the base class Processor
. See Processor API for more details.
Dataset
The Dataset
module in Qlib aims to prepare data for model training and inferencing.
Cache
The Cache
is an optional module that helps accelerate providing data by saving some frequently-used data as a cache file. Qlib provides a Memcache class to cache the most frequently used data in memory, an inheritable ExpressionCache class, and an inheritable DatasetCache class.
- Global Memory Cache
- ExpressionCache
- DatasetCache
Data and Cache File Structure
See the paper for more details.