1

We know that in machine learning the dataset is divided into 3 parts: training data, validation data and test data.

On the other hand, K-fold cross-validation is defined as follows:

the dataset is divided into K number of different sectors. One section is used for testing and the rest for training. The results of these K-iterative tests are then averaged to get the final accuracy.

What happens to validation dataset in K-fold cross-validation? Is there such a dataset?

nbro
  • 42,615
  • 12
  • 119
  • 217
DSPinfinity
  • 1,223
  • 4
  • 10

1 Answers1

1

Before doing the $k$-fold cross-val you divide all your dataset into train and test splits (e.g., 80-20 in proportion), then the $k$-fold cross validation is performed on the training split as follows:

enter image description here The training data is split into $k$ folds, where at each training iteration a fold is held out for validation: so $k-1$ folds are effectively used for training. The process repeats $k$ times, each time using a different fold for validation.

The $k$-fold cross-val approach is useful when your dataset is small, also it can be used as an easy way that provides $k$ (slightly different) models to build an ensemble (since each iteration trains a new model on a different split of the training data), which is supposed to generalize better. Moreover, since you have $k$ models you can use them to estimate the variance of their predictions, which can be used as a way to estimate uncertainty.

Luca Anzalone
  • 3,216
  • 4
  • 15