ExPretio offers the artifical intelligence solutions for passenger transport operators. The collaborator was bridged through Mitacs Accelerate project by IVADO.
Project description
Demand forecasting models in current practice are typically based on historical data sales in recent years. These so-called long-term models are relatively precise because the context often remains the same over time. In this case, the global and long-term consistency within the data (such as periodicity and seasonality) is sufficient to make a good prediction. However, unforeseen events can occur and radically change the purchasing/booking behavior of train services in events such as a pandemic, attack, line closing/opening. In this case, the future prediction will be more likely to be dominated by the special event and short-term patterns (e.g., autoregressive dynamics). Therefore, models relying on global patterns along often fail to re-adjust quickly enough to provide a satisfactory forecast.
The goal of this project is to develop new demand forecasting models for train passenger demand under special events and unseen scenarios. We will focus on developing new models that can efficiently and effectively learn and reproduce the generative mechanism of time series data under special events and unforeseen circumstances. An important component is to characterize both the long-term patterns (e.g., periodicity and seasonality) and short-term dynamics (e.g., due to the pandemic). The short-term component enables the model to adjust and pick up new trends linked to these unforeseen events quickly. They would be based, for example, only on the last fifteen days of sale to provide a more suitable forecast. Then, the model would continue to be combined with the long-term models when everything returns to normal.
In this project, the intern will apply and develop models and algorithms to address the challenges in demand forecasting. In particular, we intend to use a low-rank structure to capture the long-term consistency in the data and autoregressive dynamics and time-varying parameters to quickly update the model to account for the short-term trends. The final model will be tested on the booking data of Keolis train services.
0. About this notebook
This is the notebook example of how we process and analyze the Case B and Case C using Vector AutoRegression (VAR) model.
Running environment: Python 3.7.4. 64-bit
1. Data processing
Since we do not need to look into the days before departure level under Case B and Case C, the resolution for the booking data can be flattened into $(OD \times Products) \times DepartureDates$ for Case B and into $(OD \times DepartureTime) \times DepartureDates$ . This can be regarded as the multi-variate time series forecasting problem.
This section shows how we transform the original input as the multi-variate time series we want.
1.1. Pre-process the input data
Load original data and pre-process it. We have numbered different $OD$ and $Products$ with categorical index. Details can be referred to utils.py.
Similarly, we do the same processing for Case B and Case C benchmarks, i.e. the test set.
1.2. Construct Case B multi-variate time series
As illustrated above, for the departure date level of aggregation, we can construct multi-variate time series with the dimension of $(OD \times Products) \times DepartureDates$. It would be a matrix of dimension $K \times T$, where $K$ is different combination of OD and Products and $T$ is the window length of time series that we feed the model. In Case B, 3 different OD (A-C,A-B,B-C) and 5 pricing products mean that $K=3\times 5 = 15$ and $T$ can be manually fixed for self-updating if new data come in.
1.3. Construct Case C multi-variate time series
Case C is also based on departure date level of aggregation. Moreover, the trains depart at a hourly base. Thus similar with Case B, we can construct multi-variate time series with the dimension of $(OD \times DepartureTime) \times DepartureDates$. It would also be a matrix of dimension $K \times T$, where $K$ is different combination of OD and departure hour in the day and $T$ is the window length of time series that we feed the model.
It is important to determine the range of departing hours first.
Therefore, in Case B, 3 different OD (A-C,A-B,B-C) and 18 departure hours mean that $K=3\times 18 = 54$ and $T$ can be manually fixed for self-updating if new data come in.
2. Implement VAR model on the processed multi-variate time series
Referred from this blog, we would like to first introduce VAR a bit.
VAR is a multivariate forecasting algorithm that is used when two or more time series influence each other. To be named with “autoregressive” means that it is considered as an autoregressive model where each variable (i.e. time series) is modeled as a function of the past values, that is the predictors are nothing but the lags (time delayed values) of the series. Compared with other autoregressive models like AR, ARMA or ARIMA, the primary difference is those models are uni-directional, where, the predictors influence the $Y$ and not vice-versa. Whereas, Vector Auto Regression (VAR) is bi-directional. That is, the variables influence each other.
In the original AR models, the time series is modeled as a linear combination of it’s own lags. That is, the past values of the time series are used to forecast the current and the future. A typical $AR(p)$ model with $p$ as the lag for time series ${Y_1,Y_2,\dots,Y_n}$ is shown in this format:
where $\alpha$ is the intercept, a constant and $\beta_1,\beta_2$ till $\beta_p$ are the coefficients of the lags of $Y$ till order $p$. To be emphasized that, order $p$ means, up to $p$-lags of $Y$ are used and they are the predictors in the equation. The $\epsilon_t$ is the error term considering white noise.
Now in the VAR model, each variable is modeled as a linear combination of past values of itself and the past values of other variables in the system. Since we have multiple time series with the dimension of $K\times T$ for Case B and Case C, defined as ${Y_1^{(1)},Y_2^{(1)},\dots,Y_T^{(1)}},{Y_1^{(2)},Y_2^{(2)},\dots,Y_T^{(2)}},\dots,{Y_1^{(K)},Y_2^{(K)},\dots,Y_T^{(K)}}$ ,we should assume that each of $K$ variable should correlate with the others. Thus, to calculate the value of $Y_{t}^{(1)}$, VAR will use the past values of $Y^{(1)}$ as well as other $K-1$ time series. Take the 1-lage $VAR(2)$ as the example:
It is the same way updating $Y_{t}^{(2)},Y_{t}^{(3)},\dots,Y_{t}^{(K)}$.
2.1. VAR results for Case B
This section we use all the data before the test set as training data and forcast the period from 2020-1-18 to 2020-1-31, 14 days in total.
The basis behind Vector AutoRegression is that each of the time series in the system influences each other. That is, you can predict the series with past values of itself along with other series in the system. Using Granger’s Causality Test, it’s possible to test this relationship before even building the model. We test the causality here after building the model for a descriptive analysis.
Granger’s causality tests the null hypothesis that the coefficients of past values in the regression equation is zero. In simpler terms, the past values of time series is less than the significance level of 0.05, then, you can safely reject the null hypothesis.
It can be generally found from the green area, where the null hypothesis is rejected and indicates the existence of causality. The low pricing products have relatively strong causality in the prediction results. Some variables might not be causing itself, like B-C_115, which was interesting to look into.
You might find that the forecasting tends are relatively fkattened than texpected. We need to examine four aspects to validate the supriority of vector autoregressive models instead of doing autoregression individually:
Can the summation of each od/product really affect the demand forecasting, or is the “product” feature really important?
What if we only forecast only one step instead of do it mult-step?
What if we forecast one feature by one feature, will it be more accurate? As the causality graph does not seemed to be closely influenced with each other.
Any seasonality (more time lags added together) to improve the prediction?
The first case, we analyze the summation
This problem has been solved because we can get a better output if we choose a larger lag
2.2. VAR results for Case C
This section we use all the data before the test set as training data and forcast the period from 2020-1-18 to 2020-1-31, 14 days in total.
3. Online or New-data updating version
Previous implementations use all the historical data. However, information from the very beginning, say 2 years ago, might not be useful for forecasting the bookings in the next week. Moreover, we need the algorithm to adapt to incoming new data. We can achieve this by setting a fixed window length for the input. For example, since we want to forecast the bookings from 2020-01-18 to 2020-01-31, we can only use the past 2-3 months data as input. Whenever there are new data come in, we can retrain the model to update the parameters in the VAR model.
3.1. Case B Online version
It is proper to choose 100 days as the input time window length.
The introduction of online updating is to adapt the newly income data. Assuming we are currently on 2020-01-17 We then move our 100-day sliding window as if new data were input from 2020-01-18 to 2020-01-18 to simulate the real-time updating process. It is not the same that we use the data of validation set for forecasting! This section is to illustrate how to make it online algorithm!
It can be found that the online version can remains relatively stable MAE. Thus, when new data come in, the algorithm will still be efficient.
3.2. Case C online version
Seems that 40 day as window length is a good choice.
Slide the window
The same conclusion with Case B can be drawn for Case C. Online version does not undermine the performance.