If we had to choose between Uniform(0,1) and Uniform(-1,0), which one would you expect to work best and why?

Question

I'm working with a fully connected neural network with input 32x32x3.

The architecture includes a dense layer 32 + ReLu activation, then another dense layer 64 + ReLu Activation, followed by a dense layer 32 + ReLu Activation, and finally, a dense layer of 10 neurons with softmax activation

I have a homework question:

If we had to choose between Uniform(0,1) and Uniform(-1,0), which one would you expect to work best and why?

After searching the internet and using ChatGPT, I have reached different conclusions that are not consistent:

1- All my inputs are positive, and if I choose Uniform(0,1), my z = W.X + bias will be >0, causing my ReLU function to be linear. Therefore, Uniform(0,1) is considered worse.

2- All my inputs are positive, and if I choose Uniform(-1,0), my z < 0, leading to most of my neurons being deactivated after using ReLU. Hence, Uniform(-1,0) is considered worse.

I don't know which one is correct, or if either of them is correct, and I would like help to clarify this doubt.

Neil Slater · Accepted Answer · 2023-12-19T11:53:23.793

It is not the input to first layer you need to worry about, but the output from the hidden layer to the next layer.

No matter how the inputs and weights are arranged, after passing through ReLU in the first layer then all outputs will be zero or positive. Then in the following layer, if all weights and biases are drawn from a negative-only distribution, all sums will be below $0$ and all ReLU activations will be $0$. From that second layer onwards, no useful data is propagated, and backpropagation will return $0$ gradients, meaning the network is stuck in that state.

So the negative-only weights distribution will break a ReLU-based multi-layer network.

The positive-only network will start like a linear-only network as you suggest, but incorrect values will still work with back-propagation, because there is a non-zero gradient, and the network can learn. It should start to learn where to use negative weights.

If we had to choose between Uniform(0,1) and Uniform(-1,0), which one would you expect to work best and why?

1 Answers1