Authors:

Dingyi Zhuang, Lijun Sun, Thibault Barbier

Collaborator introduction

ExPretio offers the artifical intelligence solutions for passenger transport operators. The collaborator was bridged through Mitacs Accelerate project by IVADO.

Project description

Demand forecasting models in current practice are typically based on historical data sales in recent years. These so-called long-term models are relatively precise because the context often remains the same over time. In this case, the global and long-term consistency within the data (such as periodicity and seasonality) is sufficient to make a good prediction. However, unforeseen events can occur and radically change the purchasing/booking behavior of train services in events such as a pandemic, attack, line closing/opening. In this case, the future prediction will be more likely to be dominated by the special event and short-term patterns (e.g., autoregressive dynamics). Therefore, models relying on global patterns along often fail to re-adjust quickly enough to provide a satisfactory forecast. The goal of this project is to develop new demand forecasting models for train passenger demand under special events and unseen scenarios. We will focus on developing new models that can efficiently and effectively learn and reproduce the generative mechanism of time series data under special events and unforeseen circumstances. An important component is to characterize both the long-term patterns (e.g., periodicity and seasonality) and short-term dynamics (e.g., due to the pandemic). The short-term component enables the model to adjust and pick up new trends linked to these unforeseen events quickly. They would be based, for example, only on the last fifteen days of sale to provide a more suitable forecast. Then, the model would continue to be combined with the long-term models when everything returns to normal. In this project, the intern will apply and develop models and algorithms to address the challenges in demand forecasting. In particular, we intend to use a low-rank structure to capture the long-term consistency in the data and autoregressive dynamics and time-varying parameters to quickly update the model to account for the short-term trends. The final model will be tested on the booking data of Keolis train services.

0. About this notebook

This is the notebook example of how we process and analyze the Case B and Case C using Vector AutoRegression (VAR) model.

Running environment: Python 3.7.4. 64-bit

1. Data processing

Since we do not need to look into the days before departure level under Case B and Case C, the resolution for the booking data can be flattened into $(OD \times Products) \times DepartureDates$ for Case B and into $(OD \times DepartureTime) \times DepartureDates$ . This can be regarded as the multi-variate time series forecasting problem.

This section shows how we transform the original input as the multi-variate time series we want.

1.1. Pre-process the input data

# Load modules
%load_ext autoreload
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm_notebook
from statsmodels.tsa.api import VAR,VARMAX
from statsmodels.tsa.stattools import grangercausalitytests
from utils import *
from model import *
%matplotlib inline

Load original data and pre-process it. We have numbered different $OD$ and $Products$ with categorical index. Details can be referred to utils.py.

raw_data = pd.read_csv('data_ExPretio.csv')
raw_data['DepDateTime'] = raw_data['DepartureDate'] + ' ' + raw_data['DepartureTime']
raw_data['DepTimestamp'] = pd.to_datetime(raw_data['DepDateTime'],format='%Y-%m-%d %H:%M:%S')
raw_data['BookingDate'] = raw_data.apply(lambda x:x['DepTimestamp'].date()-datetime.timedelta(days=x['DaysBeforeDeparture']) ,axis=1)
raw_data['BookingDate'] = raw_data.apply(lambda x:datetime.datetime(x['BookingDate'].year,x['BookingDate'].month,x['BookingDate'].day) ,axis=1)
raw_data = raw_data.drop(columns=['Unnamed: 0','DepDateTime'])
raw_data = raw_data.sort_values(by='DepTimestamp')
raw_data = raw_data.reset_index()
raw_data['od_idx'] = raw_data.apply(lambda x:get_train(x['Origin-Destination']),axis=1)
raw_data['product_idx'] = raw_data.apply(lambda x:get_product(x['Product']),axis=1)
raw_data['DepDateDt'] = pd.to_datetime(raw_data['DepartureDate'],format='%Y-%m-%d')
raw_data['DepTimeDt'] = pd.to_datetime(raw_data['DepartureTime'],format='%H:%M:%S')
raw_data.head()

Similarly, we do the same processing for Case B and Case C benchmarks, i.e. the test set.

# Case B benchmark
benchmark_case_b = pd.read_csv('benchmark/output_caseB.csv')
benchmark_case_b['od'] = benchmark_case_b.apply(lambda x:str(x['Origin']) + '-' + str(x['Destination']),axis=1)
benchmark_case_b['od_idx'] = benchmark_case_b.apply(lambda x:get_train(x['od']),axis=1)
benchmark_case_b['product_idx'] = benchmark_case_b.apply(lambda x:get_product(x['Product']),axis=1)
benchmark_case_b['DepDateDt'] = pd.to_datetime(benchmark_case_b['DepartureDate'],format='%Y-%m-%d')
benchmark_case_b.head()

# Case C benchmark
benchmark_case_c = pd.read_csv('benchmark/output_caseC.csv')
benchmark_case_c['od'] = benchmark_case_c.apply(lambda x:str(x['Origin']) + '-' + str(x['Destination']),axis=1)
benchmark_case_c['od_idx'] = benchmark_case_c.apply(lambda x:get_train(x['od']),axis=1)
benchmark_case_c['DepDateDt'] = pd.to_datetime(benchmark_case_c['DepartureDate'],format='%Y/%m/%d')
benchmark_case_c['DepDateTime'] = benchmark_case_c['DepartureDate'] + ' ' + benchmark_case_c['DepartureTime']
benchmark_case_c['DepTimestamp'] = pd.to_datetime(benchmark_case_c['DepDateTime'],format='%Y/%m/%d %H:%M:%S')
benchmark_case_c.head()

1.2. Construct Case B multi-variate time series

As illustrated above, for the departure date level of aggregation, we can construct multi-variate time series with the dimension of $(OD \times Products) \times DepartureDates$ . It would be a matrix of dimension $K \times T$ , where $K$ is different combination of OD and Products and $T$ is the window length of time series that we feed the model. In Case B, 3 different OD (A-C,A-B,B-C) and 5 pricing products mean that $K=3\times 5 = 15$ and $T$ can be manually fixed for self-updating if new data come in.

n_days = (raw_data['DepDateDt'].max() - raw_data['DepDateDt'].min()).days
dep_booking_tensor_caseb = np.zeros((3*5,n_days))
print('K: %d, T: %d'%(dep_booking_tensor_caseb.shape[0],dep_booking_tensor_caseb.shape[1]))
start_date = raw_data['DepDateDt'].min()

for i in tqdm_notebook(range(n_days)):
    current_date  = start_date + datetime.timedelta(days=i)
    cur_dept_idx  = raw_data['DepDateDt']==current_date
    cur_od_idx    = raw_data['od_idx'].loc[cur_dept_idx].values
    cur_prod_idx  = raw_data['product_idx'].loc[cur_dept_idx].values
    cur_bookings  = raw_data['Bookings'].loc[cur_dept_idx].values
    for j in range(len(cur_bookings)):
        dep_booking_tensor_caseb[get_train_product_idx(cur_od_idx[j],cur_prod_idx[j]),i] += cur_bookings[j]
np.save('caseb_dep_daily_booking_time-series.npy',dep_booking_tensor_caseb)

1.3. Construct Case C multi-variate time series

Case C is also based on departure date level of aggregation. Moreover, the trains depart at a hourly base. Thus similar with Case B, we can construct multi-variate time series with the dimension of $(OD \times DepartureTime) \times DepartureDates$ . It would also be a matrix of dimension $K \times T$ , where $K$ is different combination of OD and departure hour in the day and $T$ is the window length of time series that we feed the model.

It is important to determine the range of departing hours first.

# What is the range for train departing?
raw_deptime = pd.to_datetime(raw_data['DepartureTime'],format='%H:%M:%S')
raw_deptime_hour = np.array([x.hour for x in raw_deptime])
print('minimun departure hour: ',raw_data['DepTimestamp'].loc[np.where(raw_deptime==raw_deptime.min())[0][0]],'--> hour',raw_deptime_hour.min(),
      'maximum departure hour: ',raw_data['DepTimestamp'].loc[np.where(raw_deptime==raw_deptime.max())[0][0]],'--> hour',raw_deptime_hour.max(),
     '\nrange: ',raw_deptime_hour.max()-raw_deptime_hour.min()+1)

Therefore, in Case B, 3 different OD (A-C,A-B,B-C) and 18 departure hours mean that $K=3\times 18 = 54$ and $T$ can be manually fixed for self-updating if new data come in.

n_days = (raw_data['DepDateDt'].max() - raw_data['DepDateDt'].min()).days
dep_booking_tensor_casec = np.zeros((3*18,n_days))
print('K: %d, T: %d'%(dep_booking_tensor_casec.shape[0],dep_booking_tensor_casec.shape[1]))
start_date = raw_data['DepDateDt'].min()

for i in tqdm_notebook(range(n_days)):
    current_date  = start_date + datetime.timedelta(days=i)
    cur_dept_idx  = raw_data['DepDateDt']==current_date
    cur_od_idx    = raw_data['od_idx'].loc[cur_dept_idx].values
    cur_hour_samp = raw_data['DepTimeDt'].loc[cur_dept_idx]
    cur_hour      = np.array([x.hour for x in cur_hour_samp])
    cur_bookings  = raw_data['Bookings'].loc[cur_dept_idx].values
    for j in range(len(cur_bookings)):
        dep_booking_tensor_casec[get_train_hour_idx(cur_od_idx[j],cur_hour[j]),i] += cur_bookings[j]
np.save('casec_dep_daily_booking_time-series.npy',dep_booking_tensor_casec)

2. Implement VAR model on the processed multi-variate time series

Referred from this blog, we would like to first introduce VAR a bit.

VAR is a multivariate forecasting algorithm that is used when two or more time series influence each other. To be named with “autoregressive” means that it is considered as an autoregressive model where each variable (i.e. time series) is modeled as a function of the past values, that is the predictors are nothing but the lags (time delayed values) of the series. Compared with other autoregressive models like AR, ARMA or ARIMA, the primary difference is those models are uni-directional, where, the predictors influence the $Y$ and not vice-versa. Whereas, Vector Auto Regression (VAR) is bi-directional. That is, the variables influence each other.

In the original AR models, the time series is modeled as a linear combination of it’s own lags. That is, the past values of the time series are used to forecast the current and the future. A typical $AR(p)$ model with $p$ as the lag for time series ${Y_1,Y_2,\dots,Y_n}$ is shown in this format:

$Y_{t} = \alpha + \beta_1 Y_{t-1} + \beta_2 Y_{t-2} + \dots + \beta_p Y_{t-p} + \epsilon_t$

where $\alpha$ is the intercept, a constant and $\beta_1,\beta_2$ till $\beta_p$ are the coefficients of the lags of $Y$ till order $p$ . To be emphasized that, order $p$ means, up to $p$ -lags of $Y$ are used and they are the predictors in the equation. The $\epsilon_t$ is the error term considering white noise.

Now in the VAR model, each variable is modeled as a linear combination of past values of itself and the past values of other variables in the system. Since we have multiple time series with the dimension of $K\times T$ for Case B and Case C, defined as ${Y_1^{(1)},Y_2^{(1)},\dots,Y_T^{(1)}},{Y_1^{(2)},Y_2^{(2)},\dots,Y_T^{(2)}},\dots,{Y_1^{(K)},Y_2^{(K)},\dots,Y_T^{(K)}}$ ,we should assume that each of $K$ variable should correlate with the others. Thus, to calculate the value of $Y_{t}^{(1)}$ , VAR will use the past values of $Y^{(1)}$ as well as other $K-1$ time series. Take the 1-lage $VAR(2)$ as the example:

$Y_{t}^{(1)} = \alpha_1 + \beta_{11,1}Y_{t-1}^{(1)} + \beta_{11,2}Y_{t-2}^{(1)} + \beta_{12,1}Y_{t-1}^{(2)} + \beta_{12,2}Y_{t-2}^{(2)} + \dots + \beta_{1K,1}Y_{t-1}^{(K)} + \beta_{1K,2}Y_{t-2}^{(K)} + \epsilon_{1,t}$

It is the same way updating $Y_{t}^{(2)},Y_{t}^{(3)},\dots,Y_{t}^{(K)}$ .

2.1. VAR results for Case B

This section we use all the data before the test set as training data and forcast the period from 2020-1-18 to 2020-1-31, 14 days in total.

val_date = (benchmark_case_b['DepDateDt'].min()-raw_data['DepDateDt'].min()).days # Location of 2020-1-18
end_date = (benchmark_case_b['DepDateDt'].max()-raw_data['DepDateDt'].min()).days # Location of 2020-1-30
dep_booking_tensor_caseb = np.load('caseb_dep_daily_booking_time-series.npy')
data_caseb = dep_booking_tensor_caseb[:,:end_date+1] # Data from 2018-6-1 to 2020-1-31
train_data_caseb = data_caseb[:,:val_date]
train_data_caseb = train_data_caseb.T
test_data_caseb  = data_caseb[:,val_date:]
test_data_caseb  = test_data_caseb.T

test_data_caseb

# Tuning the lags
lags  = np.arange(1,30)
mae_list = np.zeros_like(lags)
mae_list = mae_list.astype('float64')
bks_list = np.zeros_like(lags)
bks_list = bks_list.astype('int32')
for idx,lag in tqdm_notebook(enumerate(lags),total = len(mae_list)):
    model = VAR(train_data_caseb)
    results = model.fit(lag) # lag order 5 or 6 is either okay
    lag_order = results.k_ar
    prediction = results.forecast(train_data_caseb[-lag_order:], 14)
    prediction = prediction.astype('int32')
    mae_list[idx] = np.float(mae(test_data_caseb,prediction))
    bks_list[idx] = np.sum(prediction)

fig,axes = plt.subplots(1,2,figsize=(20,6))
axes[0].plot(lags,mae_list)
axes[0].set_xlabel('Lag',fontsize=16)
axes[0].set_ylabel('MAE',fontsize=16);
axes[0].set_xticks(lags);
axes[1].plot(lags,bks_list,label='Predicted bookings')
axes[1].hlines(np.sum(benchmark_case_b['Bookings']),0,29,color='r',linestyle='--',label='Actual bookings')
axes[1].set_xlabel('Lag',fontsize=16)
axes[1].set_ylabel('Total Bookings',fontsize=16);
axes[1].set_xticks(lags);
axes[1].legend(fontsize=16);

# Implement the VAR model
model = VAR(train_data_caseb)
results_caseb = model.fit(12) # Lag order can be tuned, but also consider the total bookings predicted, 3 is suitable
lag_order = results_caseb.k_ar
prediction = results_caseb.forecast(train_data_caseb[-lag_order:], 14)
prediction = prediction.astype('int32')
print('Case B VAR RMSE: %.3f; MAE: %.3f; MAE(NonZeros): %.3f; MRE: %.3f; total bookings: %d; MAE/Mean Actual: %.3f'
      %(rmse(test_data_caseb,prediction),
        mae(test_data_caseb,prediction),
        np.mean(np.abs(test_data_caseb[test_data_caseb>0]-prediction[test_data_caseb>0])),
        mre(test_data_caseb,prediction),np.sum(prediction),mae(test_data_caseb,prediction)/np.mean(test_data_caseb)))

# Plot the forecasting result
pred_res = prediction.T
val_true = test_data_caseb.T
fig,axes = plt.subplots(3,5,figsize=(20,10))
for i in range(3):
    for j in range(5):
        axes[i,j].plot(pred_res[get_train_product_idx(i,j),:],label='VAR')
        axes[i,j].plot(val_true[get_train_product_idx(i,j),:],label='Output Case B')
        axes[i,j].set_title(get_train_product_name_from_idx(i,j),fontsize=20)
        axes[i,j].tick_params(labelsize=16)
axes[i,j].legend(loc=1,fontsize=12)
axes[0,0].set_ylabel('Bookings',fontsize=16)
axes[1,0].set_ylabel('Bookings',fontsize=16)
axes[2,0].set_ylabel('Bookings',fontsize=16)
axes[2,0].set_xlabel('Days',fontsize=16)
axes[2,1].set_xlabel('Days',fontsize=16)
axes[2,2].set_xlabel('Days',fontsize=16)
axes[2,3].set_xlabel('Days',fontsize=16)
axes[2,4].set_xlabel('Days',fontsize=16)
plt.tight_layout()

# Output the prediction, writes to CSV file
case_b = pd.DataFrame(columns=['O-D','Product','Bookings','DepartureDate'])
case_b['OD'] = ['A-C','A-B','B-C']*70
temp= [44]*3 + [65]*3 + [85]*3 + [100]*3 + [115]*3
case_b['Product'] = temp*14
case_b['DepartureDate'] = ['2020/1/18']*15 + ['2020/1/19']*15 + ['2020/1/20']*15 + ['2020/1/21']*15 + ['2020/1/22']*15 + ['2020/1/23']*15 + ['2020/1/24']*15 + ['2020/1/25']*15 + ['2020/1/26']*15 + ['2020/1/27']*15 + ['2020/1/28']*15 + ['2020/1/29']*15 + ['2020/1/30']*15 + ['2020/1/31']*15 
case_b['Bookings'] = pred_res.flatten(order='F')
case_b.to_csv('output_caseB_dingyi_v2-var.csv')

The basis behind Vector AutoRegression is that each of the time series in the system influences each other. That is, you can predict the series with past values of itself along with other series in the system. Using Granger’s Causality Test, it’s possible to test this relationship before even building the model. We test the causality here after building the model for a descriptive analysis.

Granger’s causality tests the null hypothesis that the coefficients of past values in the regression equation is zero. In simpler terms, the past values of time series is less than the significance level of 0.05, then, you can safely reject the null hypothesis.

caseb_causality_matrix = np.zeros((dep_booking_tensor_caseb.shape[0],dep_booking_tensor_caseb.shape[0]))
for i in range(dep_booking_tensor_caseb.shape[0]):
    for j in range(dep_booking_tensor_caseb.shape[0]):
        tmp = results_caseb.test_causality(caused = i ,causing=j)
        caseb_causality_matrix[i,j] = tmp.pvalue

lbs = []
for i in range(5):
    for j in range(3):
        lbs.append(get_train_product_name_from_idx(j,i))
hm = sns.heatmap(caseb_causality_matrix,cmap=plt.cm.RdYlGn_r)
hm.set_title('pvalues of causality');
hm.tick_params(axis='y', rotation=0)
hm.tick_params(axis='x', rotation=90)
hm.set_yticklabels(lbs);
hm.set_xticklabels(lbs);
hm.set_ylabel('Caused');
hm.set_xlabel('Causing');

It can be generally found from the green area, where the null hypothesis is rejected and indicates the existence of causality. The low pricing products have relatively strong causality in the prediction results. Some variables might not be causing itself, like B-C_115, which was interesting to look into.

You might find that the forecasting tends are relatively fkattened than texpected. We need to examine four aspects to validate the supriority of vector autoregressive models instead of doing autoregression individually:

Can the summation of each od/product really affect the demand forecasting, or is the “product” feature really important?
What if we only forecast only one step instead of do it mult-step?
What if we forecast one feature by one feature, will it be more accurate? As the causality graph does not seemed to be closely influenced with each other.
Any seasonality (more time lags added together) to improve the prediction?

The first case, we analyze the summation

ac_idx = list(range(0,15,3))
ab_idx = list(range(1,15,3))
bc_idx = list(range(2,15,3))
prediction_noprod = np.zeros((3,pred_res.shape[1]))
val_true_noprod = np.zeros((3,val_true.shape[1]))
prediction_noprod[0,:] = pred_res[ac_idx,:].sum(axis=0)
val_true_noprod[0,:] = val_true[ac_idx,:].sum(axis=0)
prediction_noprod[1,:] = pred_res[ab_idx,:].sum(axis=0)
val_true_noprod[1,:] = val_true[ab_idx,:].sum(axis=0)
prediction_noprod[2,:] = pred_res[bc_idx,:].sum(axis=0)
val_true_noprod[2,:] = val_true[bc_idx,:].sum(axis=0)
print('Case B VAR with Product Aggregated RMSE: %.3f; MAE: %.3f; AE/NonZeros: %.3f; MRE: %.3f; total bookings: %d; MAE/Mean Actual: %.3f'
      %(rmse(val_true_noprod,prediction_noprod),
        mae(val_true_noprod,prediction_noprod),
        np.sum(np.abs(val_true_noprod-prediction_noprod))/np.sum(val_true_noprod>0),
        mre(val_true_noprod,prediction_noprod),np.sum(prediction_noprod),mae(val_true_noprod,prediction_noprod)/np.mean(val_true_noprod)))

# Plot the forecasting result
pred_res = prediction.T
val_true = test_data_caseb.T
fig,axes = plt.subplots(1,3,figsize=(20,6))
lbs = ['A-C','A-B','B-C']
for i in range(3):
    axes[i].plot(prediction_noprod[i,:],label='VAR')
    axes[i].plot(val_true_noprod[i,:],label='Output Case B')
    axes[i].set_title(lbs[i],fontsize=20)
    axes[i].tick_params(labelsize=16)
axes[i].legend(loc=1,fontsize=12)
plt.tight_layout()

This problem has been solved because we can get a better output if we choose a larger lag

# of course we can check the seasonal trend with SARIMAX model
from statsmodels.tsa.statespace.sarimax import SARIMAX
train_data_caseb.shape
sarimax_caseb_mod = SARIMAX(train_data_caseb[:,0],order=(12,0,0),seasonal_order=(1,0,0,7))
sarimax_caseb_res = sarimax_caseb.fit()
sarimax_caseb_res.summary()

2.2. VAR results for Case C

This section we use all the data before the test set as training data and forcast the period from 2020-1-18 to 2020-1-31, 14 days in total.

val_date = (benchmark_case_c['DepDateDt'].min()-raw_data['DepDateDt'].min()).days # Location of 2020-1-18
end_date = (benchmark_case_c['DepDateDt'].max()-raw_data['DepDateDt'].min()).days # Location of 2020-1-30
dep_booking_tensor_casec = np.load('casec_dep_daily_booking_time-series.npy')
data_casec = dep_booking_tensor_casec[:,:end_date+1] # Data from 2018-6-1 to 2020-1-31
train_data_casec = data_casec[:,:val_date]
train_data_casec = train_data_casec.T
test_data_casec  = data_casec[:,val_date:]
test_data_casec = test_data_casec.T

# Tuning the lags
lags  = np.arange(1,30)
mae_list = np.zeros_like(lags)
mae_list = mae_list.astype('float64')
bks_list = np.zeros_like(lags)
bks_list = bks_list.astype('int32')
for idx,lag in tqdm_notebook(enumerate(lags),total = len(mae_list)):
    model = VAR(train_data_casec)
    results = model.fit(lag) # lag order 5 or 6 is either okay
    lag_order = results.k_ar
    prediction = results.forecast(train_data_casec[-lag_order:], 14)
    prediction = prediction.astype('int32')
    mae_list[idx] = np.float(mae(test_data_casec,prediction))
    bks_list[idx] = np.sum(prediction)

# Select the lag order
fig,axes = plt.subplots(1,2,figsize=(20,6))
axes[0].plot(lags,mae_list)
axes[0].set_xlabel('Lag',fontsize=16)
axes[0].set_ylabel('MAE',fontsize=16);
axes[0].set_xticks(lags);
axes[1].plot(lags,bks_list,label='Predicted bookings')
axes[1].hlines(np.sum(benchmark_case_b['Bookings']),0,29,color='r',linestyle='--',label='Actual bookings')
axes[1].set_xlabel('Lag',fontsize=16)
axes[1].set_ylabel('Total Bookings',fontsize=16);
axes[1].set_xticks(lags);
axes[1].legend(fontsize=16);

model = VAR(train_data_casec)
results_casec = model.fit(6) # lag order 5 or 6 is either okay
lag_order = results_casec.k_ar
prediction = results_casec.forecast(train_data_casec[-lag_order:], 14)
prediction = prediction.astype('int32')
print('Case C VAR RMSE: %.3f; MAE: %.3f; MAE(NonZeros): %.3f; MRE: %.3f; total bookings: %d; MAE/Mean Actual: %.3f'
      %(rmse(test_data_casec,prediction),
        mae(test_data_casec,prediction),
        np.mean(np.abs(test_data_casec[test_data_casec>0]-prediction[test_data_casec>0])),
        mre(test_data_casec,prediction),np.sum(prediction),mae(test_data_casec,prediction)/np.mean(test_data_casec)))

pred_res = prediction.T
val_true = test_data_casec.T
fig,axes = plt.subplots(18,3,figsize=(20,40))
for i in range(18):
    axes[i,0].set_ylabel('Bookings',fontsize=16)
    for j in range(3):
        axes[i,j].plot(pred_res[get_train_hour_idx(j,i+5),:],label='VAR')
        axes[i,j].plot(val_true[get_train_hour_idx(j,i+5),:],label='Output Case C')
        axes[i,j].set_title(get_train_name_from_idx(j)+'_hour'+str(i+5),fontsize=20)
        axes[i,j].tick_params(labelsize=16)
axes[i,j].legend(loc=1,fontsize=12)
axes[-1,0].set_xlabel('Days',fontsize=16)
axes[-1,1].set_xlabel('Days',fontsize=16)
axes[-1,2].set_xlabel('Days',fontsize=16)
plt.tight_layout()

# Output the prediction, write to CSV file.
case_c = pd.DataFrame(columns=['OD','DepartureHour','DepartureDate','Bookings'])
case_c['OD'] = ['A-C','A-B','B-C']*252
temp= [5]*3 + [6]*3 + [7]*3 + [8]*3 + [9]*3 + [10]*3 + [11]*3 + [12]*3 + [13]*3 + [14]*3 + [15]*3 + [16]*3 + [17]*3 + [18]*3 + [19]*3 + [20]*3 + [21]*3 + [22]*3
case_c['DepartureHour'] = temp*14
case_c['DepartureDate'] = ['2020/1/18']*54 + ['2020/1/19']*54 + ['2020/1/20']*54 + ['2020/1/21']*54 + ['2020/1/22']*54 + ['2020/1/23']*54 + ['2020/1/24']*54 + ['2020/1/25']*54 + ['2020/1/26']*54 + ['2020/1/27']*54 + ['2020/1/28']*54 + ['2020/1/29']*54 + ['2020/1/30']*54 + ['2020/1/31']*54 
case_c['Bookings'] = pred_res.flatten(order='F')
case_c.to_csv('output_caseC_dingyi_v2-var.csv')

3. Online or New-data updating version

Previous implementations use all the historical data. However, information from the very beginning, say 2 years ago, might not be useful for forecasting the bookings in the next week. Moreover, we need the algorithm to adapt to incoming new data. We can achieve this by setting a fixed window length for the input. For example, since we want to forecast the bookings from 2020-01-18 to 2020-01-31, we can only use the past 2-3 months data as input. Whenever there are new data come in, we can retrain the model to update the parameters in the VAR model.

3.1. Case B Online version

# Same with the above, load the data, only the training set should be changed
val_date = (benchmark_case_b['DepDateDt'].min()-raw_data['DepDateDt'].min()).days # Location of 2020-1-18
end_date = (benchmark_case_b['DepDateDt'].max()-raw_data['DepDateDt'].min()).days # Location of 2020-1-30
dep_booking_tensor_caseb = np.load('caseb_dep_daily_booking_time-series.npy')
data_caseb = dep_booking_tensor_caseb[:,:end_date+1] # Data from 2018-6-1 to 2020-1-31
test_data_caseb  = data_caseb[:,val_date:]
test_data_caseb  = test_data_caseb.T

# Roughly select a best window length for online updating
window_length_caseb_choice = np.arange(30,360,10)
mae_list = np.zeros_like(window_length_caseb_choice)
mae_list = mae_list.astype('float64')
bks_list = np.zeros_like(window_length_caseb_choice)
bks_list = bks_list.astype('int32')
for idx,wl in tqdm_notebook(enumerate(window_length_caseb_choice),total = len(mae_list)):
    train_data_caseb = data_caseb[:,val_date-wl:val_date]
    train_data_caseb = train_data_caseb.T

    model = VAR(train_data_caseb)
    results = model.fit(3) # use the best lag tuned above
    lag_order = results.k_ar
    prediction = results.forecast(train_data_caseb[-lag_order:], 14)
    prediction = prediction.astype('int32')
    mae_list[idx] = np.float(mae(test_data_caseb,prediction))
    bks_list[idx] = np.sum(prediction)

fig,axes = plt.subplots(1,2,figsize=(24,6))
axes[0].plot(window_length_caseb_choice,mae_list)
axes[0].set_xlabel('Window length',fontsize=16)
axes[0].set_ylabel('MAE',fontsize=16);
axes[0].set_xticks(window_length_caseb_choice);
axes[0].set_ylim(0,100)
axes[0].tick_params(axis='x', rotation=45)
axes[1].plot(window_length_caseb_choice,bks_list,label='Predicted bookings')
axes[1].hlines(np.sum(benchmark_case_b['Bookings']),0,360,color='r',linestyle='--',label='Actual bookings')
axes[1].set_xlabel('Window length',fontsize=16)
axes[1].set_ylabel('Total Bookings',fontsize=16);
axes[1].set_xticks(window_length_caseb_choice);
axes[1].set_ylim(10000,30000)
axes[1].legend(fontsize=16);
axes[1].tick_params(axis='x', rotation=45);

It is proper to choose 100 days as the input time window length.

# Implement the VAR model
window_length_caseb = 100
train_data_caseb = data_caseb[:,val_date-window_length_caseb:val_date]
train_data_caseb = train_data_caseb.T
model = VAR(train_data_caseb)
results_caseb = model.fit(3,trend='c') # Lag order can be tuned, but also consider the total bookings predicted, 3 is suitable
lag_order = results_caseb.k_ar
prediction = results_caseb.forecast(train_data_caseb[-lag_order:], 14)
prediction = prediction.astype('int32')
print('Case B VAR Online Version with %d window length, RMSE: %.3f; MAE: %.3f; AE/NonZeros: %.3f; MRE: %.3f; total bookings: %d; MAE/Mean Actual: %.3f'
      %(window_length_caseb,
        rmse(test_data_caseb,prediction),
        mae(test_data_caseb,prediction),
        np.sum(np.abs(test_data_caseb-prediction))/np.sum(test_data_caseb>0),
        mre(test_data_caseb,prediction),np.sum(prediction),mae(test_data_caseb,prediction)/np.mean(test_data_caseb)))

# Plot the forecasting result
pred_res = prediction.T
val_true = test_data_caseb.T
fig,axes = plt.subplots(3,5,figsize=(20,10))
for i in range(3):
    for j in range(5):
        axes[i,j].plot(pred_res[get_train_product_idx(i,j),:],label='VAR')
        axes[i,j].plot(val_true[get_train_product_idx(i,j),:],label='Output Case B')
        axes[i,j].set_title(get_train_product_name_from_idx(i,j),fontsize=20)
        axes[i,j].tick_params(labelsize=16)
axes[i,j].legend(loc=1,fontsize=12)
axes[0,0].set_ylabel('Bookings',fontsize=16)
axes[1,0].set_ylabel('Bookings',fontsize=16)
axes[2,0].set_ylabel('Bookings',fontsize=16)
axes[2,0].set_xlabel('Days',fontsize=16)
axes[2,1].set_xlabel('Days',fontsize=16)
axes[2,2].set_xlabel('Days',fontsize=16)
axes[2,3].set_xlabel('Days',fontsize=16)
axes[2,4].set_xlabel('Days',fontsize=16)
plt.tight_layout()

The introduction of online updating is to adapt the newly income data. Assuming we are currently on 2020-01-17 We then move our 100-day sliding window as if new data were input from 2020-01-18 to 2020-01-18 to simulate the real-time updating process. It is not the same that we use the data of validation set for forecasting! This section is to illustrate how to make it online algorithm!

mae_list_online = np.zeros(test_data_caseb.shape[0]-1)
for new_date in range(1,len(mae_list_online)+1):
    new_train_data_caseb = train_data_caseb[new_date:,:]
    new_train_data_caseb = np.concatenate((new_train_data_caseb,test_data_caseb[:new_date,:]),axis=0)
    new_test_data_caseb = test_data_caseb[new_date:,:]
    model = VAR(new_train_data_caseb)
    results_caseb = model.fit(3,trend='c') # Lag order is tuned in previous section.
    lag_order = results_caseb.k_ar
    prediction = results_caseb.forecast(new_train_data_caseb[-lag_order:], 14-new_date) # Not necessary to shrink the forecasting interval in the real case
    prediction = prediction.astype('int32')
    mae_list_online[new_date-1] = mae(new_test_data_caseb,prediction)

x = list(range(1,len(mae_list_online)+1))
plt.plot(x,mae_list_online)
plt.xticks(x)
plt.ylabel('MAE')
plt.xlabel('Newly income dates');

It can be found that the online version can remains relatively stable MAE. Thus, when new data come in, the algorithm will still be efficient.

3.2. Case C online version

# Same with the above, load the data, only the training set should be changed
val_date = (benchmark_case_c['DepDateDt'].min()-raw_data['DepDateDt'].min()).days # Location of 2020-1-18
end_date = (benchmark_case_c['DepDateDt'].max()-raw_data['DepDateDt'].min()).days # Location of 2020-1-30
dep_booking_tensor_casec = np.load('casec_dep_daily_booking_time-series.npy')
data_casec = dep_booking_tensor_casec[:,:end_date+1] # Data from 2018-6-1 to 2020-1-31
test_data_casec  = data_casec[:,val_date:]
test_data_casec  = test_data_casec.T

# Roughly select a best window length for online updating
window_length_casec_choice = np.arange(30,360,10)
mae_list = np.zeros_like(window_length_casec_choice)
mae_list = mae_list.astype('float64')
bks_list = np.zeros_like(window_length_casec_choice)
bks_list = bks_list.astype('int32')
for idx,wl in tqdm_notebook(enumerate(window_length_casec_choice),total = len(mae_list)):
    train_data_casec = data_casec[:,val_date-wl:val_date]
    train_data_casec = train_data_casec.T

    model = VAR(train_data_casec)
    results = model.fit(6) # use the best lag tuned above
    lag_order = results.k_ar
    prediction = results.forecast(train_data_casec[-lag_order:], 14)
    prediction = prediction.astype('int32')
    mae_list[idx] = np.float(mae(test_data_casec,prediction))
    bks_list[idx] = np.sum(prediction)

fig,axes = plt.subplots(1,2,figsize=(24,6))
axes[0].plot(window_length_casec_choice,mae_list)
axes[0].set_xlabel('Window length',fontsize=16)
axes[0].set_ylabel('MAE',fontsize=16);
axes[0].set_xticks(window_length_casec_choice);
axes[0].set_ylim(0,100)
axes[0].tick_params(axis='x', rotation=45)
axes[1].plot(window_length_casec_choice,bks_list,label='Predicted bookings')
axes[1].hlines(np.sum(benchmark_case_c['Bookings']),0,360,color='r',linestyle='--',label='Actual bookings')
axes[1].set_xlabel('Window length',fontsize=16)
axes[1].set_ylabel('Total Bookings',fontsize=16);
axes[1].set_xticks(window_length_casec_choice);
axes[1].set_ylim(10000,30000)
axes[1].legend(fontsize=16);
axes[1].tick_params(axis='x', rotation=45);

Seems that 40 day as window length is a good choice.

# Implement the VAR model
window_length_casec = 40
train_data_casec = data_casec[:,val_date-window_length_casec:val_date]
train_data_casec = train_data_casec.T
model = VAR(train_data_casec)
results_casec = model.fit(6,trend='c') 
lag_order = results_casec.k_ar
prediction = results_casec.forecast(train_data_casec[-lag_order:], 14)
prediction = prediction.astype('int32')
print('Case C VAR Online Version with %d window length, RMSE: %.3f; MAE: %.3f; AE/NonZeros: %.3f; MRE: %.3f; total bookings: %d; MAE/Mean Actual: %.3f'
      %(window_length_casec,
        rmse(test_data_casec,prediction),
        mae(test_data_casec,prediction),
        np.sum(np.abs(test_data_casec-prediction))/np.sum(test_data_casec>0),
        mre(test_data_casec,prediction),np.sum(prediction),mae(test_data_casec,prediction)/np.mean(test_data_casec)))

pred_res = prediction.T
val_true = test_data_casec.T
fig,axes = plt.subplots(18,3,figsize=(20,40))
for i in range(18):
    axes[i,0].set_ylabel('Bookings',fontsize=16)
    for j in range(3):
        axes[i,j].plot(pred_res[get_train_hour_idx(j,i+5),:],label='VAR')
        axes[i,j].plot(val_true[get_train_hour_idx(j,i+5),:],label='Output Case C')
        axes[i,j].set_title(get_train_name_from_idx(j)+'_hour'+str(i+5),fontsize=20)
        axes[i,j].tick_params(labelsize=16)
axes[i,j].legend(loc=1,fontsize=12)
axes[-1,0].set_xlabel('Days',fontsize=16)
axes[-1,1].set_xlabel('Days',fontsize=16)
axes[-1,2].set_xlabel('Days',fontsize=16)
plt.tight_layout()

Slide the window

mae_list_online = np.zeros(test_data_casec.shape[0]-1)
for new_date in range(1,len(mae_list_online)+1):
    new_train_data_casec = train_data_casec[new_date:,:]
    new_train_data_casec = np.concatenate((new_train_data_casec,test_data_casec[:new_date,:]),axis=0)
    new_test_data_casec = test_data_casec[new_date:,:]
    model = VAR(new_train_data_casec)
    results_casec = model.fit(3,trend='c') # Lag order is tuned in previous section.
    lag_order = results_casec.k_ar
    prediction = results_casec.forecast(new_train_data_casec[-lag_order:], 14-new_date) # Not necessary to shrink the forecasting interval in the real case
    prediction = prediction.astype('int32')
    mae_list_online[new_date-1] = mae(new_test_data_casec,prediction)

x = list(range(1,len(mae_list_online)+1))
plt.plot(x,mae_list_online)
plt.xticks(x)
plt.ylabel('MAE')
plt.xlabel('Newly income dates');

The same conclusion with Case B can be drawn for Case C. Online version does not undermine the performance.

Multi-horizon booking demand forecasting with ExPretio

You may find interesting:

Inductive Graph Neural Networks for Kriging

Accepted by AAAI 2021