Skip to main content

DDG-DA


Introduction

Due to the non-stationary nature of the real-world environment, the data distribution could keep changing with continuous data streaming. Such a phenomenon/problem is called concept drift (Lu et al. 2018open in new window), where the basic assumption is that concept drift happens unexpectedly and is unpredictable for streaming data.

To handle concept drift, previous studies usually leverage a two-step approach.

  • Detect the concept drift.
  • Adapt the model to the new data distribution.
    • Retrain the model
    • Fine-tune the model

Assumption

The latest data contains more useful information than the previous data.

Existing methods handle concept drift on the latest arrived data Data(t)Data^{(t)} at timestamp tt and adapt the forecasting model accordingly. The concept drift continues, and the adapted model on Data(t)Data^{(t)} will be used on unseen streaming data in the future (e.g., Data(t+1)Data^{(t+1)}). The previous model adaptation has a one-step delay to the concept drift of upcoming streaming data, which means a new concept drift has occurred between timestamp tt and t+1t + 1.

In this paper, we focus on predictable concept drift by forecasting future data distribution.

Streaming Data and Concept Drift

  • Streaming Data

    X={x(1),x(2),⋯ ,x(T)} \boldsymbol{X} = \{\boldsymbol{x}^{(1)}, \boldsymbol{x}^{(2)}, \cdots, \boldsymbol{x}^{(T)}\}

    where each element x(t)∈Rm\boldsymbol{x}^{(t)} \in \mathbb{R}^{m} is a mm-dimensional vector.

    x(t)=[x1(t),x2(t),⋯ ,xm(t)] \boldsymbol{x}^{(t)} = [x^{(t)}_{1}, x^{(t)}_{2}, \cdots, x^{(t)}_{m}]

    Given a target sequence y={y(1),y(2),⋯ ,y(T)}\boldsymbol{y} = \{y^{(1)}, y^{(2)}, \cdots, y^{(T)}\} corresponding to X\boldsymbol{X}.
  • Algorithm are designed to build the model on historical data {x(i),y(i)}i=1t\{\boldsymbol{x}^{(i)}, y^{(i)}\}_{i=1}^{t} and predict the future data x(t)\boldsymbol{x}^{(t)} and forecast y\boldsymbol{y} on unseen streaming data Dtest(t)={x(t),y(t)}t=1TD^{(t)}_{test} = \{\boldsymbol{x}^{(t)}, y^{(t)}\}_{t=1}^{\mathcal{T}}.
  • Assume (x(t),y(t))∼pt(x,y)(\boldsymbol{x}^{(t)}, y^{(t)}) \sim p_t(\boldsymbol{x}, y), where pt(x,y)p_t(\boldsymbol{x}, y) is the data distribution at timestamp tt. Generally, pt(x,y)p_t(\boldsymbol{x}, y) is non-stationary and keeps changing with time tt, which is called concept drift. Formally, the concept drift between two timestamps tt and t+1t + 1 can be defined as

    βˆƒx:pt(x,y)β‰ pt+1(x,y) \exists \boldsymbol{x}: p_t(\boldsymbol{x}, y) \neq p_{t+1}(\boldsymbol{x}, y)

  • Adapting models to accommodate the evolving data distribution.

    arg⁑min⁑f(t),f(t+1),…,f(t+T)=βˆ‘i=tt+Tl(f(i)(x(i)),y(i)) \arg \min_{f^{(t)}, f^{(t+1)}, \dots, f^{(t+\mathcal{T})}} = \sum_{i=t}^{t+\mathcal{T}} \mathcal{l}(f^{(i)}(\boldsymbol{x}^{(i)}), y^{(i)})

    where f(i)f^{(i)} is the model learned from data Dtrain(i)={x(i),y(i)}i=tβˆ’ktβˆ’1D^{(i)}_{train} = \{\boldsymbol{x}^{(i)}, y^{(i)}\}_{i=t-k}^{t-1} and l\mathcal{l} with windows size kk.

The Categorization of Concept Drift

  • Measure the changing speed
    • abrupt: the data distribution changes suddenly
    • gradual: the data distribution changes slowly
  • recurring or non-recurring
    • recurring: the data distribution changes periodically and the period is fixed
    • non-recurring: the data distribution changes periodically and the period is not fixed

Previous Work

  • Use a small trigger window to detect the concept drift and adapt the model to the new data distribution after the concept drift happens.
  • Incremental adaptation.

DDG-DA

  • Predict and adapt before the concept drift happens.
  • Model-agnostic.

Method Design

Overall

Learning

  • DDG-DA (annotated as MΘ\mathcal{M}_{\Theta}) models the concept drift and predict ptest(t)(x,y)p_{test}^{(t)}(\boldsymbol{x}, y).
  • MΘ\mathcal{M}_{\Theta} art like a weightd data sampler to resample on Dtrain(t)D^{(t)}_{train} and generate a new training set Dresam(t)(Θ)∼presam(t)(x,y;Θ)D^{(t)}_{resam}(\Theta) \sim p_{resam}^{(t)}(\boldsymbol{x}, y; \Theta).
  • Minimize the gap between ptest(t)(x,y)p_{test}^{(t)}(\boldsymbol{x}, y) and presam(t)(x,y;Θ)p_{resam}^{(t)}(\boldsymbol{x}, y; \Theta).
  • During the training process,

Forecast

  • Given task(t)∈Tasktesttask^{(t)} \in Task_{test}, the forecast model is trained on Dresam(t)(Θ)D^{(t)}_{resam}(\Theta) and forcast on Dtest(t)D^{(t)}_{test}.
  • presam(t)(x,y;Θ)p^{(t)}_{resam}(\boldsymbol{x}, y; \Theta) is more similar to ptest(t)(x,y)p^{(t)}_{test}(\boldsymbol{x}, y) than ptrain(t)(x,y)p^{(t)}_{train}(\boldsymbol{x}, y). So the preference of model f(t)f^{(t)} on Dresam(t)(Θ)D^{(t)}_{resam}(\Theta) is more similar to f(t)f^{(t)} on Dtest(t)D^{(t)}_{test} than f(t)f^{(t)} on Dtrain(t)D^{(t)}_{train}. Task split

Example

To handle the concept drift in data, we retrain a new model each month (the rolling time interval is 1 month) based on two years of historical data.

Each chance to retrain a new model to adapt the concept drift is called a task. For example, the task task(2011/01)task^{(2011/01)} contains Dtrain(2011/01)D^{(2011/01)}_{train} from 2009/01 to 2010/12 and Dtest(2011/01)D^{(2011/01)}_{test} in 2011/01.

Set all Dtest(t)D^{(t)}_{test} range from 2011 to 2015 and DDG-DA will evaluated on TasktestTask_{test} range from 2016 to 2020.

Model design and learning process

  • Build a set of tasks Tasktrain={task(1),task(2),⋯ ,task(T)}Task_{train} = \{task^{(1)}, task^{(2)}, \cdots, task^{(\mathcal{T})}\}.
  • The goal of the learned DDG-Da is to improve the performance of the forecast model on Tasktest:={task(T+1),task(T+2),⋯ ,task(T)}Task_{test} := \{task^{(\mathcal{T}+1)}, task^{(\mathcal{T}+2)}, \cdots, task^{(T)}\}.

Feature Design

Objective Function

  • MΘ\mathcal{M}_{\Theta} accepts the extracted feature and outputs the probability distribution presam(t)(x,y;Θ)p_{resam}^{(t)}(\boldsymbol{x}, y; \Theta).
  • To minimize the gap between presam(t)(x,y;Θ)p_{resam}^{(t)}(\boldsymbol{x}, y; \Theta) and ptest(t)(x,y)p_{test}^{(t)}(\boldsymbol{x}, y), we use the KL divergence as the objective function.

    LΘ(task(t))=DKL(ptest(t)(x,y)∣∣presam(t)(x,y;Θ)) L_{\Theta}(task^{(t)}) = D_{KL}(p_{test}^{(t)}(\boldsymbol{x}, y) || p_{resam}^{(t)}(\boldsymbol{x}, y; \Theta))

    or

    LΘ(task(t))=Ex∼ptest(t)(x)[DKL(ptest(t)(y∣x)∣∣presam(t)(y∣x;Θ))] L_{\Theta}(task^{(t)}) = \mathbb{E}_{\boldsymbol{x} \sim p_{test}^{(t)}(\boldsymbol{x})} \left[ D_{KL}(p_{test}^{(t)}(y|\boldsymbol{x}) || p_{resam}^{(t)}(y|\boldsymbol{x}; \Theta)) \right]

    where DKLD_{KL} is the KL divergence. ∣∣|| is the divergence between two distributions. Ex∼ptest(t)(x)\mathbb{E}_{\boldsymbol{x} \sim p_{test}^{(t)}(\boldsymbol{x})} is the expectation over the test data distribution ptest(t)(x)p_{test}^{(t)}(\boldsymbol{x}).
  • Normal distribution assumption is reasonable for unknown variables and often used in maximum likelihood estimation. So we assume ptest(t)(y∣x)p_{test}^{(t)}(y|\boldsymbol{x}) and presam(t)(y∣x;Θ)p_{resam}^{(t)}(y|\boldsymbol{x}; \Theta) are normal distributions.

    ptest(t)(y∣x)=N(ytest(t)(x),Οƒ) p_{test}^{(t)}(y|\boldsymbol{x}) = \mathcal{N}(y^{(t)}_{test}(\boldsymbol{x}), \sigma)

    presam(t)(y∣x;Θ)=N(yresam(t)(x;Θ),Οƒ) p_{resam}^{(t)}(y|\boldsymbol{x}; \Theta) = \mathcal{N}(y^{(t)}_{resam}(\boldsymbol{x}; \Theta), \sigma)

    Tips

    yresam(t)(x;Θ)y^{(t)}_{resam}(\boldsymbol{x}; \Theta) is the expectation of yy under the predicted distribution presam(t)(y∣x;Θ)p_{resam}^{(t)}(y|\boldsymbol{x}; \Theta).

  • According to the definition of KL divergence, we have

    LΘ(task(t))=12βˆ‘(x,y)∈Dtest(t)∣∣yresam(t)(x;Θ)βˆ’y∣∣2 L_{\Theta}(task^{(t)}) = \frac{1}{2} \sum_{(\boldsymbol{x}, y) \in D^{(t)}_{test}} || y^{(t)}_{resam}(\boldsymbol{x}; \Theta) - y||^2

    Summarize losses of all tasks, we have

    Ξ˜β‹†=arg⁑minβ‘Ξ˜βˆ‘task(t)∈TasktrainLΘ(task(t)) \Theta^{\star} = \arg \min_{\Theta} \sum_{task^{(t)} \in Task_{train}} L_{\Theta}(task^{(t)})

Optimization

  • Build a regression proxy model yproxy(t)(x;Ο•)y_{proxy}^{(t)}(\boldsymbol{x}; \phi) to approximate yresam(t)(x;Θ)y^{(t)}_{resam}(\boldsymbol{x}; \Theta).

    Ο•(t)=arg⁑minβ‘Ο•βˆ‘(x,y)∈Dtrain(t)(Θ)∣∣yproxy(t)(x;Ο•)βˆ’y∣∣2 \phi^{(t)} = \arg \min_{\phi} \sum_{(\boldsymbol{x}, y) \in D^{(t)}_{train}(\Theta)} || y_{proxy}^{(t)}(\boldsymbol{x}; \phi) - y||^2

Code

References