Skip to main content

Quick Start


Initialization

# get data
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn

# initialize qlib
import qlib
# region in [REG_CN, REG_US]
from qlib.constant import REG_CN
provider_uri = "~/.qlib/qlib_data/cn_data"  # target_dir
qlib.init(provider_uri=provider_uri, region=REG_CN)

For qlib.init(), you can also use the following parameters:

  • provider_uri: the path of the data
  • region:
    • Different regions will result in different trading limitations and costs. Currently: qlib.constant.REG_US (‘us’) and qlib.constant.REG_CN (‘cn’) is supported.
    • The region is just shortcuts for defining a batch of configurations, which include minimal trading order unit (trade_unit), trading limitation (limit_threshold), etc. See below for detailsopen in new window.
      _default_region_config = {
      REG_CN: {
          "trade_unit": 100,
          "limit_threshold": 0.095,
          "deal_price": "close",
      },
      REG_US: {
          "trade_unit": 1,
          "limit_threshold": None,
          "deal_price": "close",
      },
      }
      
    • Users can set the key configurations manually if the existing region setting can’t meet their requirements.
  • redis_host and redis_port

    Caution

    Wait for reading the source code for details.

  • exp_manager: specify an experiment manager class, as well as the tracking URI for all the experiments.
    # For example, if you want to set your tracking_uri to a <specific folder>, you can initialize qlib below
    qlib.init(provider_uri=provider_uri, region=REG_CN, exp_manager= {
        "class": "MLflowExpManager",
        "module_path": "qlib.workflow.expm",
        "kwargs": {
            "uri": "python_execution_path/mlruns",
            "default_exp_name": "Experiment",
        }
    })
    
  • mongo:
    • Install MongoDBopen in new window
    • Access mongodb with credential by setting “task_url” to a string like “mongodb://%s:%s@%s” % (user, pwd, host + “:” + port).
    # For example, you can initialize qlib below
    qlib.init(provider_uri=provider_uri, region=REG_CN, mongo={
        "task_url": "mongodb://localhost:27017/",  # your mongo url
        "task_db_name": "rolling_db", # the database name of Task Management
    })
    
  • logging_level
  • kernels: The number of processes used when calculating features in Qlib’s expression engine. It is very helpful to set it to 1 when you are debuggin an expression calculating exception

Data Retrieval

initialize

import qlib
qlib.init(provider_uri='~/.qlib/qlib_data/cn_data')

Use D to retrieve data

from qlib.data import D

# get calendar
calendar = D.calendar(start_time='2008-01-01', end_time='2021-01-01', freq='day')

# get instrument
instruments = D.instruments(market='csi300') # or market='all', etc.

Filter

  • nameDFilter
from qlib.data.filter import NameDFilter
nameDFilter = NameDFilter(name_rule_re='SH[0-9]{4}55')
instruments = D.instruments(market='csi300', filter_pipe=[nameDFilter])
D.list_instruments(instruments=instruments, start_time='2015-01-01', end_time='2016-02-15', as_list=True)
  • expressionDFilter
from qlib.data.filter import ExpressionDFilter
expressionDFilter = ExpressionDFilter(rule_expression='$close>2000')
instruments = D.instruments(market='csi300', filter_pipe=[expressionDFilter])

Load features

from qlib.data import D
instruments = ["SH600000"]
fields = ["$close", "$volume", "Ref($close, 1)", "Ref($close, 3)", "$high-$low"]
D.features(instruments, fields, start_time='2015-01-01', end_time='2016-02-15', freq='day').head().to_string()

About cache when loading features

With cache enabled, it may take longer to process the request. But after that, requests with the same stock pool and fields will hit the cache and be processed faster even if the requested period changes.

When calling D.features() at the client,

  • use parameter disk_cache=0 to skip the dataset cache
  • use disk_cache=1 to generate and use the dataset cache.
  • when calling the server, users can use disk_cache=2 to update the dataset cache.

Two ways to load complex features:

# 1. use expression
from qlib.data import D
data = D.features(["sh600519"], ["(($high / $close) + ($open / $close)) * (($high / $close) + ($open / $close)) / (($high / $close) + ($open / $close))"], start_time="20200101")

# 2. use expression ops
from qlib.data.ops import *
f1 = Feature("high") / Feature("close")
f2 = Feature("open") / Feature("close")
f3 = f1 + f2
f4 = f3 * f3 / f3
data = D.features(["sh600519"], [f4], start_time="20200101")
data.head()

Custom Model Integration

Here are 3 steps to integrate your model into Qlib.

The sample code is as follows:

Override __init__

def __init__(self, loss='mse', **kwargs):
    if loss not in {'mse', 'binary'}:
        raise NotImplementedError
    self._scorer = mean_squared_error if loss == 'mse' else roc_auc_score
    self._params.update(objective=loss, **kwargs)
    self._model = None

Override fit method

Must include training feature dataset

# num_boost_round is the number of boosting iterations, here we set it to optional
def fit(self, dataset: DatasetH, num_boost_round = 1000, **kwargs):

    # prepare dataset for lgb training and evaluation
    df_train, df_valid = dataset.prepare(
        ["train", "valid"], col_set=["feature", "label"], data_key=DataHandlerLP.DK_L
    )
    x_train, y_train = df_train["feature"], df_train["label"]
    x_valid, y_valid = df_valid["feature"], df_valid["label"]

    # Lightgbm need 1D array as its label
    if y_train.values.ndim == 2 and y_train.values.shape[1] == 1:
        y_train, y_valid = np.squeeze(y_train.values), np.squeeze(y_valid.values)
    else:
        raise ValueError("LightGBM doesn't support multi-label training")

    dtrain = lgb.Dataset(x_train.values, label=y_train)
    dvalid = lgb.Dataset(x_valid.values, label=y_valid)

    # fit the model
    self.model = lgb.train(
        self.params,
        dtrain,
        num_boost_round=num_boost_round,
        valid_sets=[dtrain, dvalid],
        valid_names=["train", "valid"],
        early_stopping_rounds=early_stopping_rounds,
        verbose_eval=verbose_eval,
        evals_result=evals_result,
        **kwargs
    )

Override predict method

Must include the parameter dataset, which will be used to get the test dataset.

def predict(self, dataset: DatasetH, **kwargs)-> pandas.Series:
    if self.model is None:
        raise ValueError("model is not fitted yet!")
    x_test = dataset.prepare("test", col_set="feature", data_key=DataHandlerLP.DK_I)

    # Return the prediction score.
    return pd.Series(self.model.predict(x_test.values), index=x_test.index)

Override the finetune method (Optional)

  • Must include the parameter dataset.
  • Should inherit the ModelFT base class
def finetune(self, dataset: DatasetH, num_boost_round=10, verbose_eval=20):
    # Based on existing model and finetune by train more rounds
    dtrain, _ = self._prepare_data(dataset)
    self.model = lgb.train(
        self.params,
        dtrain,
        num_boost_round=num_boost_round,
        init_model=self.model,
        valid_sets=[dtrain],
        valid_names=["train"],
        verbose_eval=verbose_eval,
    )

Configuration file

model:
    class: LGBModel
    module_path: qlib.contrib.model.gbdt
    args:
        loss: mse
        colsample_bytree: 0.8879
        learning_rate: 0.0421
        subsample: 0.8789
        lambda_l1: 205.6999
        lambda_l2: 580.9768
        max_depth: 8
        num_leaves: 210
        num_threads: 20