What are possible reasons for the validation loss increasing with more data?

Question

I trained a neural network on an NLP problem and compared the loss and BLEU score on the validation data with the same training parameters in two scenarios: a) when I trained on 25% of the data, b) when I trained on 100% of the data. I observed a peculiar pattern: the validation loss and BLEU score is much lower in b) as compared to a). What could this mean? My guess is that there are duplicates in the training data, which lead to overfitting, but I still have to investigate that.

score 1 · Accepted Answer · answered Mar 24 '23 at 17:48

As you pointed out, duplicates can be a possible reason for such behavior. There are a few more possibilities:

Class Imbalance - data is skewed towards a particular class(if you are solving a classification problem).
Model isn't learning well for 100% of data for given parameters. Try changing the learning rate, or adding some constraints such as:

if t == int(args.num_iter * 0.5):
     lr = some_small_value*lr
if t == int(args.num_iter * 0.75):
     lr = smaller_value*lr

It may so happen that the model you have selected is unable to handle weights when input data is large. Try changing the model and see if some other model works better.

I hope this helps.

What are possible reasons for the validation loss increasing with more data?

1 Answers1