Quick Start

Initialization

# get data
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn

# initialize qlib
import qlib
# region in [REG_CN, REG_US]
from qlib.constant import REG_CN
provider_uri = "~/.qlib/qlib_data/cn_data"  # target_dir
qlib.init(provider_uri=provider_uri, region=REG_CN)

For qlib.init(), you can also use the following parameters:

provider_uri: the path of the data
region:
- Different regions will result in different trading limitations and costs. Currently: qlib.constant.REG_US (‘us’) and qlib.constant.REG_CN (‘cn’) is supported.
- The region is just shortcuts for defining a batch of configurations, which include minimal trading order unit (trade_unit), trading limitation (limit_threshold), etc. See below for detailsopen in new window.
```
_default_region_config = {
REG_CN: {
    "trade_unit": 100,
    "limit_threshold": 0.095,
    "deal_price": "close",
},
REG_US: {
    "trade_unit": 1,
    "limit_threshold": None,
    "deal_price": "close",
},
}
```
- Users can set the key configurations manually if the existing region setting can’t meet their requirements.
redis_host and redis_port
Caution
Wait for reading the source code for details.

exp_manager: specify an experiment manager class, as well as the tracking URI for all the experiments.

# For example, if you want to set your tracking_uri to a <specific folder>, you can initialize qlib below
qlib.init(provider_uri=provider_uri, region=REG_CN, exp_manager= {
    "class": "MLflowExpManager",
    "module_path": "qlib.workflow.expm",
    "kwargs": {
        "uri": "python_execution_path/mlruns",
        "default_exp_name": "Experiment",
    }
})

mongo:

Install MongoDBopen in new window
Access mongodb with credential by setting “task_url” to a string like “mongodb://%s:%s@%s” % (user, pwd, host + “:” + port).

# For example, you can initialize qlib below
qlib.init(provider_uri=provider_uri, region=REG_CN, mongo={
    "task_url": "mongodb://localhost:27017/",  # your mongo url
    "task_db_name": "rolling_db", # the database name of Task Management
})

logging_level
kernels: The number of processes used when calculating features in Qlib’s expression engine. It is very helpful to set it to 1 when you are debuggin an expression calculating exception

Data Retrieval

initialize

import qlib
qlib.init(provider_uri='~/.qlib/qlib_data/cn_data')

Use D to retrieve data

from qlib.data import D

# get calendar
calendar = D.calendar(start_time='2008-01-01', end_time='2021-01-01', freq='day')

# get instrument
instruments = D.instruments(market='csi300') # or market='all', etc.

Filter

nameDFilter

from qlib.data.filter import NameDFilter
nameDFilter = NameDFilter(name_rule_re='SH[0-9]{4}55')
instruments = D.instruments(market='csi300', filter_pipe=[nameDFilter])
D.list_instruments(instruments=instruments, start_time='2015-01-01', end_time='2016-02-15', as_list=True)

expressionDFilter

from qlib.data.filter import ExpressionDFilter
expressionDFilter = ExpressionDFilter(rule_expression='$close>2000')
instruments = D.instruments(market='csi300', filter_pipe=[expressionDFilter])

See Filter APIopen in new window for more details.

Load features

from qlib.data import D
instruments = ["SH600000"]
fields = ["$close", "$volume", "Ref($close, 1)", "Ref($close, 3)", "$high-$low"]
D.features(instruments, fields, start_time='2015-01-01', end_time='2016-02-15', freq='day').head().to_string()

About cache when loading features

With cache enabled, it may take longer to process the request. But after that, requests with the same stock pool and fields will hit the cache and be processed faster even if the requested period changes.

When calling D.features() at the client,

use parameter disk_cache=0 to skip the dataset cache
use disk_cache=1 to generate and use the dataset cache.
when calling the server, users can use disk_cache=2 to update the dataset cache.

Two ways to load complex features:

# 1. use expression
from qlib.data import D
data = D.features(["sh600519"], ["(($high / $close) + ($open / $close)) * (($high / $close) + ($open / $close)) / (($high / $close) + ($open / $close))"], start_time="20200101")

# 2. use expression ops
from qlib.data.ops import *
f1 = Feature("high") / Feature("close")
f2 = Feature("open") / Feature("close")
f3 = f1 + f2
f4 = f3 * f3 / f3
data = D.features(["sh600519"], [f4], start_time="20200101")
data.head()

Custom Model Integration

Here are 3 steps to integrate your model into Qlib.

Define your model by inheriting qlib.model.base.Modelopen in new window.
Write a configuration file that describes the path and parameters of the custom model.
Test the custom model.

The sample code is as follows:

Override `init`

def __init__(self, loss='mse', **kwargs):
    if loss not in {'mse', 'binary'}:
        raise NotImplementedError
    self._scorer = mean_squared_error if loss == 'mse' else roc_auc_score
    self._params.update(objective=loss, **kwargs)
    self._model = None

Override fit method

Must include training feature dataset

# num_boost_round is the number of boosting iterations, here we set it to optional
def fit(self, dataset: DatasetH, num_boost_round = 1000, **kwargs):

    # prepare dataset for lgb training and evaluation
    df_train, df_valid = dataset.prepare(
        ["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
    )
    x_train, y_train = df_train["feature"], df_train["label"]
    x_valid, y_valid = df_valid["feature"], df_valid["label"]

    # Lightgbm need 1D array as its label
    if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
        y_train, y_valid = np.squeeze(y_train.values), np.squeeze(y_valid.values)
    else:
        raise ValueError("LightGBM doesn't support multi-label training")

    dtrain = lgb.Dataset(x_train.values, label=y_train)
    dvalid = lgb.Dataset(x_valid.values, label=y_valid)

    # fit the model
    self.model = lgb.train(
        self.params,
        dtrain,
        num_boost_round=num_boost_round,
        valid_sets=[dtrain, dvalid],
        valid_names=["train", "valid"],
        early_stopping_rounds=early_stopping_rounds,
        verbose_eval=verbose_eval,
        evals_result=evals_result,
        **kwargs
    )

Override predict method

Must include the parameter dataset, which will be used to get the test dataset.

def predict(self, dataset: DatasetH, **kwargs)-> pandas.Series:
    if self.model is None:
        raise ValueError("model is not fitted yet!")
    x_test = dataset.prepare("test", col_set="feature", data_key=DataHandlerLP.DK_I)

    # Return the prediction score.
    return pd.Series(self.model.predict(x_test.values), index=x_test.index)

Override the finetune method (Optional)

Must include the parameter dataset.
Should inherit the ModelFT base class

def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20):
    # Based on existing model and finetune by train more rounds
    dtrain, _ = self._prepare_data(dataset)
    self.model = lgb.train(
        self.params,
        dtrain,
        num_boost_round=num_boost_round,
        init_model=self.model,
        valid_sets=[dtrain],
        valid_names=["train"],
        verbose_eval=verbose_eval,
    )

Configuration file

model:
    class: LGBModel
    module_path: qlib.contrib.model.gbdt
    args:
        loss: mse
        colsample_bytree: 0.8879
        learning_rate: 0.0421
        subsample: 0.8789
        lambda_l1: 205.6999
        lambda_l2: 580.9768
        max_depth: 8
        num_leaves: 210
        num_threads: 20

Quick Start

# Initialization

# Data Retrieval

# initialize

# Use D to retrieve data

# Filter

# Load features

# Custom Model Integration

# Override __init__

# Override fit method

# Override predict method

# Override the finetune method (Optional)

# Configuration file