How to choose validation data?

Question

To train a deep learning model, I work with a dataset that is divided into train and test parts by constructors.

I'm stuck on how to select some data for validation? From the train part or from the test part?

It seems that dividing the test part into a validation and a new smaller test part is a good idea; but I'm not sure.

Any hint is appreciated.

score 1 · Accepted Answer · answered Aug 03 '24 at 22:08

When training a ML or DL model it's crucial to have a separate validation data set to tune hyperparameters and prevent overfitting. Validation data set should come from the original training data which is divided into a new training data set and a validation data set.

A validation data set is a data set of examples used to tune the hyperparameters (i.e. the architecture) of a classifier. It is sometimes also called the development set or the "dev set"... In order to get more stable results and use all valuable data for training, a data set can be repeatedly split into several training and a validation data sets. This is known as cross-validation. To confirm the model's performance, an additional test data set held out from cross-validation is normally used.

The original test data set is used to evaluate the final model performance on unseen data and usually we keep the test data set separate and only use it for the final evaluation of the model after all training and validation steps are completed. Using part of the test set for validation would mean that the test set is no longer completely unseen, leading to potential data leakage and biased performance metrics.

score 0 · Answer 2 · answered Aug 07 '24 at 02:30

0

Total Dataset = (Training set + Validation set ) + Test set

Typical dataset Ratios:

Training set : 60-80% Validation set : 10-20% Test Set : 10-20%

answered Aug 07 '24 at 02:30

Biku

11
2

How to choose validation data?

2 Answers2