2

I am training a one-layer unidirectional vanilla GRU on a next item prediction task with regard to the last 10 interacted items. In my original experiment, where I trained on approx. 5.5M samples and validated on around 1M samples, I saw periodic fluctuations in my loss/accuracy curves. Now I made a few experiments with a far smaller dataset (train: 250K, val: 10K) and compared different learning rates. I still saw this effect and I would like to know what causes it, just to understand if it is an issue or not.

I am evaluating every 25% per epoch, so 4 times per epoch. I accumulate the loss and the hit-ratio for the processed training batches and when it is time for evaluation, based on these I compute train metrics and then compute the validation metrics on the entire validation set.

I am using Adam as optimizer.

enter image description here

The plots show HR@10 (=accuracy for top-10 predictions) and CrossEntropyLoss for train and val. One step is referring to 25% of an epoch, so step 0 = 25% of first epoch, step 4 = 25% of second epoch done. You can see that the phenomenon gets more extreme for higher learning rates. The spike in training HR/loss always happens at the first evaluation during an epoch (so 25%). It is paired with a drop in validation accuracy, which gets better for the next three steps (50%, 75% 100% of epoch). Then falls again when spike in training accuracy occurs.

Things I excluded already:

  • data gets fed randomly, it is not always the same order
  • checked whether gradient is accidentally not reset before calling backward()
  • possible wrong calculation of loss/accuracy due to wrong number of batches I divide by

I would appreciate any help to understand what is happening with the model. It still seems to work but I just don't understand what causes this behavior.

PatrickSVM
  • 53
  • 3

1 Answers1

2

Oscillating loss is a symptom of divergent training: it can be due to large gradient updates, and/or numerical instability. Moreover, you said that it gets worse with higher LR: that's a confirm.

To fix that you should try:

  • A decaying lr schedule, that basically after a given amount of optimization steps (or epochs) your lr is reduced by a factor $\gamma$ ($\gamma < 1$) until a minimum lr is reached. The decay can be linear (so $\gamma$ is proportional to the number of steps) or exponential for example.
  • Layer normalization: this can help to stabilize the gradient, for a faster training.
  • Gradient clipping: it can scale the gradients individually (per-layer) or globally to not be above a given norm. Basically it reduces the magnitude of the gradients to control the size of the update.

I suggest you to also monitor the (euclidean) norm of both gradients and weights. So that you can understand if gradients are big or exploding/vanishing, and also have a look at possible overfitting.

Update: numerical instability of the loss computation can lead to large gradients, and even NaNs. In particular, the suggested usage of the cross-entropy loss is from logits (i.e. unnormalized log-probabilities of the classes) since it can be implemented in a numerically stable way, conversely when directly applied to a softmax output it can be unstable.

Luca Anzalone
  • 3,216
  • 4
  • 15