Validation loss is always lower than training loss whatever i try

Question

I've been training several types of MLPs with different optimisers and tuned them with keras's hyperband tuner. All of them follow this cone architecture:

All the networks were trained on the same dataset composed of 342k input examples and another different 62k examples for the validation (aprox 20% ratio). No example is shared between them. The data come from the same distribution :

KstestResult(statistic=0.06393057115770212, pvalue=1.8775454622663613e-165, statistic_location=-0.045045708259117134, statistic_sign=1)

Whatever i try, the validation loss starts below the training one as show below in the graphs (the val_loss in the graphs is shifted by 0.5epch to account for the calculation of train_loss):

1st: Adadelta and mae 2nd: Adam and mse 3rd: Adadelta and mse 4th: Adam and mse + L1 Dropout

All the models have very similar rmse and accuracy and are almost usable, the 3rd one also performing the best out of the bunch. I'm curious why does the val_loss always start below the trainig one, and if this is a problem since it would technically mean that i'm overfitting the model from the start :-? Also what could i do to further investigate the issue and general advice. Thanks in advance for the replies!

score 1 · Answer 1 · answered Jan 10 '24 at 13:31

This you will usually face a lot and according to my experience, it is not really abnormal.

One of the major reasons for this is due to the training loss being calculated during training for each epoch, while validation loss is calculated after the training is done, so the model already has previous knowledge about what kind of data you are using and essentially its distribution.

The second reason could be if your dense layers are regularized (as I see you are using L1 dropout), this regularization happens only during training. So the training loss kinda compensates for a better validation loss (this is the job of regularization).

Also I see that your validation set is only 15% of the total data, you can increase this percentage to about 20%.

One of my favorite ML books author Aurelien Geron explained all these in a thread.

Validation loss is always lower than training loss whatever i try

1 Answers1