Why to use gradient accumulation?

Question

I know that gradient accumulation is (1) a way to reduce memory usage while still enabling the machine to fit a large dataset (2) reducing the noise of the gradient compared to SGD, and thus smoothing the training process.

However, I wonder what causes the noise of the gradient? Is it because of the random sampling strategy on large datasets, intrinsic noise in datasets (like wrong labels) or anything else?

For a small dataset is it okay to do stochastic gradient descent and what might be the motivation for using gradient accumulation techniques on it?

score 4 · Answer 1 · answered Jan 27 '23 at 20:34

This image from here nicely illustrates how gradient accumulation is performed:

Assuming infinite memory and compute we would be able to compute the gradient on the full batch, this would provide us with the true gradient!

In reality, the full batch gradient is often not possible to compute since holding the full dataset in memory is infeasible. The network weights are thus updated several times per epoch using mini-batches. Mini-batches provide an approximation of the true gradient. It is noisy since it is computed on only a part of the dataset. Alternatively, mini-batch gradients can be computed without updating the model immediately. Accumulating the gradients is referred to as gradient accumulation and if one accumulates the gradients during all steps, the sum of the gradients is the same as the gradient obtained on the full batch. There currently exists some debate about whether the loss landscape changes induced by mini batching might actually be beneficial.

For further information also see this blog post for the trade-offs that exist between full-batch and mini-batch learning.

Why to use gradient accumulation?

1 Answers1