8

Two of the most popular initialization schemes for neural network weights today are Xavier and He. Both methods propose random weight initialization with a variance dependent on the number of input and output units. Xavier proposes

$$W \sim \mathcal{U}\Bigg[-\frac{\sqrt{6}}{\sqrt{n_{in}+n_{out}}},\frac{\sqrt{6}}{\sqrt{n_{in}+n_{out}}}\Bigg]$$

for networks with $\text{tanh}$ activation function and He proposes

$$W \sim \mathcal{N}(0,\sqrt{s/n_{in}})$$

for $\text{ReLU}$ activation. Both initialization schemes are implemented in the most commonly used deep learning libraries for python, PyTorch and TensorFlow.
However, for both versions we have a normal and uniform version. Now the main argument of both papers is about the variance of the information at initialization time (which is dependent on the non-linearity) and that it should stay constant across all layers when back-propagating. I see how one can simply adjust the bounds $[-a,a]$ of a uniform variable in such a way that the random variable has the desired standard deviation and vice versa ($\sigma = a/\sqrt{3}$), but I'm not sure why we need a normal and a uniform version for both schemes? Wouldn't it be just enough to have only normal or only uniform? Or uniform Xavier and normal He as proposed in their papers?

I can imagine uniform distributions are easier to sample from a computational point of view, but since we do the initialization operation only once at the beginning, the computational cost is negligible compared to that from training. Further uniform variables are bounded, so there are no long tail observations as one would expect in a normal. I suppose that's why both libraries have truncated normal initializations.

Are there any theoretical, computational or empirical justifications for when to use a normal over a uniform, or a uniform over a normal weight initialization regardless of the final weight variance?

Tinu
  • 628
  • 1
  • 4
  • 14

1 Answers1

3

Are there any theoretical, computational or empirical justifications for when to use a normal over a uniform, or a uniform over a normal weight initialization regardless of the final weight variance?

There is no hard evidence that one initialization works better than the other in certain cases. It seems to be down to experience and empirical results. As you mentioned, there are no computational differences in the initialization itself, and even if there were, it would be minuscule compared to the training itself.

Even though I typically don't expect the initialization to impact the end results dramatically, here is my rule of thumb when picking uniform vs normal:

  • Uniform: when the architecture is small/compact, I expect all weights to "contribute", so typically smaller architectures.

The reasoning behind this is that let's say the "optimal" model has all of its weights being non-zero (i.e. all weights are important), when you uniformly initialize, the "distance" between the initial weight and the "ideal" weight will be somewhat shorter and you'd hope to converge quicker.

  • Normal: when my architecture is likely "bigger than needed, " typically larger architectures.

Similarly, if I expect my "ideal" model to have relatively sparse weights (i.e. a lot of them are zeros), I'd like my weights to be as close as possible to that.

So, in short, your initialization might impact how quickly you converge and, therefore, "where" you would converge. How "sparse" your converged network does impact the model's complexity and robustness, but if you do have a really sparse network at the end, then perhaps your architecture could have been made much smaller in the first place.