Should we always use the usual no leakage train-val-test splt in time series?

Question

Some of you may be familiar with the unusual split scheme used for time-series data. In short, there is a saying that one should only consider a split where the training set comes prior to the testing set (in terms of index or timedate), as otherwise we essentially use future data to infer.

Namely, given the dataset $\mathcal{D}=\{(x_1,y_1),...,(x_n,y_n)\}$, a viable split may look like \begin{align} train=\{(x_1,y_1),...,(x_j,y_j)\}\subseteq\mathcal{D}\\ test=\{(x_{j+1},y_{j+1}),...,(x_k,y_k)\}\subseteq\mathcal{D} \end{align} for some $j\leq k\leq n$

My question is - are there some cases where random splitting is O.K in time series? Also, what is the main problem with random sampling?

Please clarify what you mean by 'random splitting'. It sounds like you want to choose j randomly. The standard practice is to split data with 80% for training and 20% for testing. When the data has a low number of observations many people will choose 90% for training and 10% for testing. — Brian O'Donnell, Dec 31 '22 at 23:54

Brian O'Donnell · Answer 1 · 2023-01-01T00:19:18.090

Most time series contain time-dependent information. That means that the time series has order. The temporal information, therefore, cannot be randomly sampled without losing some pertinent information.

By splitting a time series, we retain the temporal information to generalize a model, such as a classifier or a regressor, while setting aside a smaller time series to test. A simple split, such as 80%/20% can be done or something fancier such as K-Fold validation can be done.

The hope is that the information in the training dataset persists and appears in the testing dataset. It is possible that an event changes the time series enough, and coincidentally at the split time, such that the testing dataset is of no value. If that happens, you have little option but to wait until there is enough information to justify a new split while trimming off the data before the event.

The following paper reviews the many strategies to splitting time series data:

Cerqueira, V., Torgo, L., & Mozetič, I. (2020). Evaluating time series forecasting models: an empirical study on performance estimation methods. Machine Learning, 109(11), 1997–2028. https://doi.org/10.1007/s10994-020-05910-7

The paper is also available as a PDF here.

Should we always use the usual no leakage train-val-test splt in time series?

1 Answers1