Why is it a problem if the outputs of an activation function are not zero-centered?

Question

In this lecture, the professor says that one problem with the sigmoid function is that its outputs aren't zero-centered. Are the explanation provided by the professor regarding why this is bad is that the gradient of our loss w.r.t. the weights $\frac{\partial L}{\partial w}$ which is equal to $\frac{\partial L}{\partial \sigma}\frac{\partial \sigma}{\partial w}$ will always be either negative or positive and we'll have a problem updating our weights as she shows in this slide, we won't be able to move in the direction of the vector $(1,-1)$. I don't understand why since she only talks about one component of our gradient and not the whole vector. if the components of the gradient of our loss will have different signs which will allow us to adjust to different directions I'm I wrong? But the thing that I don't understand is how this property generalizes to non zero-centered functions and non-zero centered data?

score 3 · Accepted Answer · edited May 31 '22 at 00:29

Yes, if the activation function of the network is not zero centered, $y = f(x^{T}w)$ is always positive or always negative. Thus, the output of a layer is always being moved to either the positive values or the negative values. As a result, the weight vector needs more updates to be trained properly, and the number of epochs needed for the network to get trained also increases. This is why the zero centered property is important, though it is NOT necessary.

Zero-centered activation functions ensure that the mean activation value is around zero. This property is important in deep learning because it has been empirically shown that models operating on normalized data––whether it be inputs or latent activations––enjoy faster convergence.

Unfortunately, zero-centered activation functions like tanh saturate at their asymptotes –– the gradients within this region get vanishingly smaller over time, leading to a weak training signal.

ReLU avoids this problem but it is not zero-centered. Therefore all-positive or all-negative activation functions whether sigmoid or ReLU can be difficult for gradient-based optimization. So, To solve this problem deep learning practitioners have invented a myriad of Normalization layers (batch norm, layer norm, weight norm, etc.). we can normalize the data in advance to be zero-centered as in batch/layer normalization.

Reference:

A Survey on Activation Functions and their relation with Xavier and He Normal Initialization

Why is it a problem if the outputs of an activation function are not zero-centered?

1 Answers1