How to normalize gradient value due to the batch size?

Question

A = (m x n) - input

B = (n x k) - weight

output = A @ B = (m x k)

outputloss = (m x k)

doutput/dB = A.T @ outputloss = (n x m) @ (m x k) = (n x k)

So, as we see m (batch size) is inner dimension and it is droping out. But the value of gradient is still depends on the m.

I have tried to normalize it by m or sqrt(m) in my own nn implementation (doutput/dB = A.T @ outputloss / m, doutput/dB = A.T @ outputloss / sqrt(m), but none of this really helped on big batches (64, 128). Any ideas?

cinch · Answer 1 · 2024-11-02T07:03:58.597

Indeed large batch sizes often behave more deterministically during training and yield suboptimal performance in sharp minima. If normalizing gradients did not help, consider adjusting the learning rate first. Larger batch sizes often require careful tuning of the learning rate. A learning rate that works well for mini-batches may cause instability when using larger batches, leading to either slower convergence or oscillatory behavior during training. A common heuristic in deep learning is to scale the learning rate linearly with the batch size. And in general LARS is a suitable learning rate scheduler for cases involving large batch sizes.

LARS (Layer-wise Adaptive Rate Scaling) is an optimizer designed for training with large batch sizes to accelerate training. LARS uses a separate learning rate for each layer instead of each parameter. The learning rate is calculated from a trust ratio between the weight and gradient norm in a layer. This helps calibrate a stable update size.

If gradients are exploding via monitoring gradient statistics during training, gradient clipping might help maintain stability. In some cases, using large batches can lead to large gradient updates if the learning rate is not adjusted accordingly.

For larger and deep networks, implementing batch normalization with your existing (64, 128) batch size might help stabilize the learning process for faster convergence to a good non-local minima.

Finally you can always adjust batch size to become mini-batch.

How to normalize gradient value due to the batch size?

1 Answers1