Why does Batch Normalization work?

Question

Adding BatchNorm layers improves training time and makes the whole deep model more stable. That's an experimental fact that is widely used in machine learning practice.

My question is - why does it work?

The original (2015) paper motivated the introduction of the layers by stating that these layers help fixing "internal covariate shift". The rough idea is that large shifts in the distributions of inputs of inner layers makes training less stable, leading to a decrease in the learning rate and slowing down of the training. Batch normalization mitigates this problem by standardizing the inputs of inner layers.

This explanation was harshly criticized by the next (2018) paper -- quoting the abstract:

... distributional stability of layer inputs has little to do with the success of BatchNorm

They demonstrate that BatchNorm only slightly affects the inner layer inputs distributions. More than that -- they tried to inject some non-zero mean/variance noise into the distributions. And they still got almost the same performance.

Their conclusion was that the real reason BatchNorm works was that...

Instead BatchNorm makes the optimization landscape significantly smoother.

Which, to my taste, is slightly tautological to saying that it improves stability.

I've found two more papers trying to tackle the question: In this paper the "key benefit" is claimed to be the fact that Batch Normalization biases residual blocks towards the identity function. And in this paper that it "avoids rank collapse".

So, is there any bottom line? Why does BatchNorm work?

Tom Huntington · Answer 1 · 2022-04-03T05:55:11.040

I believe anything in machine learning that works, works because it flattens and smoothens the loss landscape.

Batch and layer normalization would help ensure that the feature vectors (i.e. channels) are embedded around the unit sphere Batch/Instance norm translates to origin. Layer norm scales radially to unit sphere. Viewing neural networks as transformations, this would make the loss landscape smoother since the transformations the neural net needs to find would be more "regular".

I would recomend this video to learn about loss landscapes.

From Visualizing the Loss Landscape of Neural Nets. NeuRIPS 2018:

Ashwiniku918 · Answer 2 · 2022-01-06T11:52:12.797

When we are training deep neural Network gradient tells how to update each parameter, under the assumption other layers do not change.In Practice, we update all the layers simultaneously. When we update, unexpected results can happen because many functions composed together are changed simultaneously using updates that were computed under the assumption that other function remains constant.This makes it very hard to choose an appropriate leaning rate, because the effects of an update to the parametrs of one layer strongly on all other layers.

How does Batch Normalisation Help : Batch Normalisation a layer which is added to any input or hidden layer in the neural network. Suppose H is the minitach of activations of the layer to normalize.

The formula for normalizing H is :

$_H = \frac{H - Mean}{Standard Deviation}$

Mean : Vector Containing Mean of each unit Standard Deviation : Vector Containing Mean of each unit

At training time mean and sd are calculated and when we backpropogate through these operations for apply mean, sd and Normalize H. This means that gradient will never propose an operation that acts simply to increase the standard deviation and mean of hi, the normalization operation remove the effect of such an action and zero out its componenr in the gradient. Hence Batch Normalisation thus ensure no or slight covariance shift in the input to layer after Batch Normalisation and thus improving learning time as shown in the original paper mentioned in question.

For more details : https://www.deeplearningbook.org/contents/optimization.html

score 0 · Answer 3 · answered Feb 08 '22 at 09:02

This got me thinking about my understanding of batch normalization. I thought I understand it until I read this. Then, I refer to the Coursera deep learning specialization by Andrew Ng.
Prof. Andrew Ng explained it this way.

One reason why does batch norm work is that it normalizes not only the input features but also further values in the hidden units to take on a similar range of values that can speed up learning.

The second reason why batch norm works, is it makes weights, later or deeper than the network you have, say the weight on layer 10, more robust to changes to weights in earlier layers of the neural network (eg. in layer one). However, these hidden unit values are changing all the time, and so it's suffering from the problem of covariate shift. So what batch norm does, is it reduces the amount that the distribution of these hidden unit values shifts around. What batch norm ensures is that no matter how the parameters of the neural network update, their mean and variance will at least stay the same mean and variance, causing the input values to become more stable, so that the later layers of the neural network has more firm ground to stand on.

And even though the input distribution changes a bit, it changes less, and what this does is, even as the earlier layers keep learning, the amounts that this forces the later layers to adapt to as early as layer changes is reduced or, if you will, it weakens the coupling between what the early layers parameters has to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speeding up of learning in the whole network. Takeaway is that batch norm means that, especially from the perspective of one of the later layers of the neural network, the earlier layers don't get to shift around as much, because they're constrained to have the same mean and variance. And so this makes the job of learning on the later layers easier. It turns out batch norm has a second effect, it has a slight regularization effect. So one non-intuitive thing of a batch norm is that each mini-batch, the mean and variance computed on just that mini-batch as opposed to computed on the entire data set, that mean and variance has a little bit of noise in it, because it's computed just on your mini-batch of, say, 64, or 128, or maybe 256 or larger training examples. Batch norm works with mini-batch

score -1 · Answer 4 · answered Jun 28 '21 at 14:00

It is a question with no simple answer.

On one hand the BatchNormalization is unloved by some arguing it doesn't change the accuracy of neural networks or biased them. On the other hand, it is highly recommended by the other because it leads to better trained models with a larger scope of predictions and less chances of overflow.

All I know for sure is that BN is really efficient on image classification. In fact, like the image categorization and classification soar this last years and that BN is a good practice in this field, it has spread to almost all DNNs.

Not only is the BN not always used in the right purpose, but it is often used without taking into account several elements such as :

The layers between which apply BN
The initializer algorithms
The activation algorithms
etc

For more computer sciences litterature "against" BN, I will let you look at the H. Zhang et al paper who has trained a DNN without BN and get good results.

Some people use Gradient Clipping technique (R. Pascanu) instead of the BN in particular for RNNs

I hope it will give you some answers !

score -2 · Answer 5 · answered Apr 16 '21 at 18:16

-2

To some extend, it get rid of low intensity numerical noise. Condition properties of the optimization problem is always an issue, i suspect BatchNorm alleviate this instability.

answered Apr 16 '21 at 18:16

Pablo Roman Asenjo

1
1

Why does Batch Normalization work?

5 Answers5

Linked