6

My knowledge

Suppose you have a layer that is fully connected, and that each neuron performs an operation like

a = g(w^T * x + b)

were a is the output of the neuron, x the input, g our generic activation function, and finally w and b our parameters.

If both w and b are initialized with all elements equal to each other, then a is equal for each unit of that layer.

This means that we have symmetry, thus at each iteration of whichever algorithm we choose to update our parameters, they will update in the same way, thus there is no need for multiple units since they all behave as a single one.

In order to break the symmetry, we could randomly initialize the matrix w and initialize b to zero (this is the setup that I've seen more often). This way a is different for each unit so that all neurons behave differently.

Of course, randomly initializing both w and b would be also okay even if not necessary.

Question

Is randomly initializing w the only choice? Could we randomly initialize b instead of w in order to break the symmetry? Is the answer dependent on the choice of the activation function and/or the cost function?

My thinking is that we could break the symmetry by randomly initializing b, since in this way a would be different for each unit and, since in the backward propagation the derivatives of both w and b depend on a(at least this should be true for all the activation functions that I have seen so far), each unit would behave differently. Obviously, this is only a thought, and I'm not sure that is absolutely true.

nbro
  • 42,615
  • 12
  • 119
  • 217
gvgramazio
  • 706
  • 2
  • 8
  • 20

3 Answers3

5

Randomising just b sort of works, but setting w to all zero causes severe problems with vanishing gradients, especially at the start of learning.

Using backpropagation, the gradient at the outputs of a layer L involves a sum multiplying the gradient of the inputs to layer L+1 by the weights (and not the biases) between the layers. This will be zero if the weights are all zero.

A gradient of zero at L's output will further cause all earlier layers(L-1, L-2 etc all the way back to layer 1) to receive zero gradients, and thus not update either weights or bias at the update step. So the first time you run an update, it will only affect the last layer. Then the next time, it will affect the two layers closest to the output (but only marginally at the penultimate layer) and so on.

A related issue is that with weights all zero, or all the same, maps all inputs, no matter how they vary, onto the same output. This also can adversely affect the gradient signal that you are using to drive learning - for a balanced data set you have a good chance of starting learning close to a local minimum in the cost function.

For deep networks especially, to fight vanishing (or exploding) gradients, you should initialise weights from a distribution that has an expected magnitude (after multiplying the inputs) and gradient magnitude that neither vanishes nor explodes. Analysis of values that work best in deep networks is how Xavier/Glorot initialisation were discovered. Without careful initialisation along these lines, deep networks take much longer to learn, or in worst cases never recover from a poor start and fail to learn effectively.

Potentially to avoid these problems you could try to find a good non-zero fixed value for weights, as an alternative to Xavier initialisation, along with a good magnitude/distribution for bias initialisation. These would both vary according to size of the layer and possibly by the activation function. However, I would suspect this could suffer from other issues such sampling bias issues - there are more weights, therefore you get a better fit to desired aggregate behaviour when setting all the weight values randomly than you would for setting biases randomly.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
2

Most of the explanations given for choosing something or not choosing something (like hyperparameter tuning) in deep learning are based on empirical studies, like analysing the error over a number of iterations. So, this answer is what people in deep learning side give.

Since you have asked for a mathematical explanation, I suggest you read the paper Convergence Analysis of Two-layer Neural Networks with ReLU Activation (2017, NIPS). It talks about the convergence of SGD to global minima subject to weight initialisation being Gaussian using ReLU as an activation function. The paper considers a neural net with no hidden layer, just input, and output layers.

The very fact that analysis on such 'simple' network gets published in a very reputed and top conference itself suggests that the explanation you are seeking is not very easy and very few people work on the theoretical aspects of neural nets. IMHO, after some years as the research progresses, I might be able to edit this answer and give the necessary explanation that you sought. Till then this is the best I could do.

nbro
  • 42,615
  • 12
  • 119
  • 217
varsh
  • 562
  • 7
  • 19
0

w should be randomized to small (nonzero) numbers so that the adjustments made by the backpropagation are more meaningful and each value in the matrix is updated a different amount. If you start with all zeros, it will still work, but take longer to get to a meaningful result. AFAIK, this was found empirically by various researchers and became common practice.

Randomizing b does not have the same effect of helping, therefore most people do not bother.

This choice is one of many that is made by the architect of the network and theoretically you could use an infinite number of w matrix initializations. The one commonly used just happens to be tested and generally works.

This video is better at explaining than I am: Lecture 8.4 — Neural Networks Representation | Model Representation-II — [Andrew Ng].

nbro
  • 42,615
  • 12
  • 119
  • 217
Joe S
  • 138
  • 6