Quick Start
Initialization
# get data
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn
# initialize qlib
import qlib
# region in [REG_CN, REG_US]
from qlib.constant import REG_CN
provider_uri = "~/.qlib/qlib_data/cn_data" # target_dir
qlib.init(provider_uri=provider_uri, region=REG_CN)
For qlib.init(), you can also use the following parameters:
provider_uri
: the path of the dataregion
:- Different regions will result in different trading limitations and costs. Currently:
qlib.constant.REG_US
(âusâ) andqlib.constant.REG_CN
(âcnâ) is supported. - The region is just shortcuts for defining a batch of configurations, which include minimal trading order unit (
trade_unit
), trading limitation (limit_threshold
), etc. See below for details._default_region_config = { REG_CN: { "trade_unit": 100, "limit_threshold": 0.095, "deal_price": "close", }, REG_US: { "trade_unit": 1, "limit_threshold": None, "deal_price": "close", }, }
- Users can set the key configurations manually if the existing region setting canât meet their requirements.
- Different regions will result in different trading limitations and costs. Currently:
redis_host
andredis_port
Caution
Wait for reading the source code for details.
exp_manager
: specify an experiment manager class, as well as the tracking URI for all the experiments.# For example, if you want to set your tracking_uri to a <specific folder>, you can initialize qlib below qlib.init(provider_uri=provider_uri, region=REG_CN, exp_manager= { "class": "MLflowExpManager", "module_path": "qlib.workflow.expm", "kwargs": { "uri": "python_execution_path/mlruns", "default_exp_name": "Experiment", } })
mongo
:- Install MongoDB
- Access mongodb with credential by setting âtask_urlâ to a string like âmongodb://%s:%s@%sâ % (user, pwd, host + â:â + port).
# For example, you can initialize qlib below qlib.init(provider_uri=provider_uri, region=REG_CN, mongo={ "task_url": "mongodb://localhost:27017/", # your mongo url "task_db_name": "rolling_db", # the database name of Task Management })
logging_level
kernels
: The number of processes used when calculating features in Qlibâs expression engine. It is very helpful to set it to 1 when you are debuggin an expression calculating exception
Data Retrieval
initialize
import qlib
qlib.init(provider_uri='~/.qlib/qlib_data/cn_data')
Use D to retrieve data
from qlib.data import D
# get calendar
calendar = D.calendar(start_time='2008-01-01', end_time='2021-01-01', freq='day')
# get instrument
instruments = D.instruments(market='csi300') # or market='all', etc.
Filter
nameDFilter
from qlib.data.filter import NameDFilter
nameDFilter = NameDFilter(name_rule_re='SH[0-9]{4}55')
instruments = D.instruments(market='csi300', filter_pipe=[nameDFilter])
D.list_instruments(instruments=instruments, start_time='2015-01-01', end_time='2016-02-15', as_list=True)
expressionDFilter
from qlib.data.filter import ExpressionDFilter
expressionDFilter = ExpressionDFilter(rule_expression='$close>2000')
instruments = D.instruments(market='csi300', filter_pipe=[expressionDFilter])
- See Filter API for more details.
Load features
from qlib.data import D
instruments = ["SH600000"]
fields = ["$close", "$volume", "Ref($close, 1)", "Ref($close, 3)", "$high-$low"]
D.features(instruments, fields, start_time='2015-01-01', end_time='2016-02-15', freq='day').head().to_string()
About cache when loading features
With cache enabled, it may take longer to process the request. But after that, requests with the same stock pool and fields will hit the cache and be processed faster even if the requested period changes.
When calling D.features() at the client,
- use parameter
disk_cache=0
to skip the dataset cache - use
disk_cache=1
to generate and use the dataset cache. - when calling the server, users can use
disk_cache=2
to update the dataset cache.
Two ways to load complex features:
# 1. use expression
from qlib.data import D
data = D.features(["sh600519"], ["(($high / $close) + ($open / $close)) * (($high / $close) + ($open / $close)) / (($high / $close) + ($open / $close))"], start_time="20200101")
# 2. use expression ops
from qlib.data.ops import *
f1 = Feature("high") / Feature("close")
f2 = Feature("open") / Feature("close")
f3 = f1 + f2
f4 = f3 * f3 / f3
data = D.features(["sh600519"], [f4], start_time="20200101")
data.head()
Custom Model Integration
Here are 3 steps to integrate your model into Qlib.
- Define your model by inheriting
qlib.model.base.Model
. - Write a configuration file that describes the path and parameters of the custom model.
- Test the custom model.
The sample code is as follows:
__init__
Override def __init__(self, loss='mse', **kwargs):
if loss not in {'mse', 'binary'}:
raise NotImplementedError
self._scorer = mean_squared_error if loss == 'mse' else roc_auc_score
self._params.update(objective=loss, **kwargs)
self._model = None
Override fit method
Must include training feature dataset
# num_boost_round is the number of boosting iterations, here we set it to optional
def fit(self, dataset: DatasetH, num_boost_round = 1000, **kwargs):
# prepare dataset for lgb training and evaluation
df_train, df_valid = dataset.prepare(
["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
)
x_train, y_train = df_train["feature"], df_train["label"]
x_valid, y_valid = df_valid["feature"], df_valid["label"]
# Lightgbm need 1D array as its label
if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
y_train, y_valid = np.squeeze(y_train.values), np.squeeze(y_valid.values)
else:
raise ValueError("LightGBM doesn't support multi-label training")
dtrain = lgb.Dataset(x_train.values, label=y_train)
dvalid = lgb.Dataset(x_valid.values, label=y_valid)
# fit the model
self.model = lgb.train(
self.params,
dtrain,
num_boost_round=num_boost_round,
valid_sets=[dtrain, dvalid],
valid_names=["train", "valid"],
early_stopping_rounds=early_stopping_rounds,
verbose_eval=verbose_eval,
evals_result=evals_result,
**kwargs
)
Override predict method
Must include the parameter dataset, which will be used to get the test dataset.
def predict(self, dataset: DatasetH, **kwargs)-> pandas.Series:
if self.model is None:
raise ValueError("model is not fitted yet!")
x_test = dataset.prepare("test", col_set="feature", data_key=DataHandlerLP.DK_I)
# Return the prediction score.
return pd.Series(self.model.predict(x_test.values), index=x_test.index)
Override the finetune method (Optional)
- Must include the parameter dataset.
- Should inherit the
ModelFT
base class
def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20):
# Based on existing model and finetune by train more rounds
dtrain, _ = self._prepare_data(dataset)
self.model = lgb.train(
self.params,
dtrain,
num_boost_round=num_boost_round,
init_model=self.model,
valid_sets=[dtrain],
valid_names=["train"],
verbose_eval=verbose_eval,
)
Configuration file
model:
class: LGBModel
module_path: qlib.contrib.model.gbdt
args:
loss: mse
colsample_bytree: 0.8879
learning_rate: 0.0421
subsample: 0.8789
lambda_l1: 205.6999
lambda_l2: 580.9768
max_depth: 8
num_leaves: 210
num_threads: 20