8

In a nutshell: I want to understand why a one hidden layer neural network converges to a good minimum more reliably when a larger number of hidden neurons is used. Below a more detailed explanation of my experiment:

I am working on a simple 2D XOR-like classification example to understand the effects of neural network initialization better. Here's a visualisation of the data and the desired decision boundary: enter image description here

Each blob consists of 5000 data points. The minimal complexity neural network to solve this problem is a one-hidden layer network with 2 hidden neurons. Since this architecture has the minimum number of parameters possible to solve this problem (with a NN) I would naively expect that this is also the easiest to optimise. However, this is not the case.

I found that with random initialization this architecture converges around half of the time, where convergence depends on the signs of the weights. Specifically, I observed the following behaviour:

w1 = [[1,-1],[-1,1]], w2 = [1,1] --> converges
w1 = [[1,1],[1,1]],   w2 = [1,-1] --> converges
w1 = [[1,1],[1,1]],   w2 = [1,1] --> finds only linear separation
w1 = [[1,-1],[-1,1]], w2 = [1,-1] --> finds only linear separation

This makes sense to me. In the latter two cases the optimisation gets stuck in suboptimal local minima. However, when increasing the number of hidden neurons to values greater than 2, the network develops a robustness to initialisation and starts to reliably converge for random values of w1 and w2. You can still find pathological examples, but with 4 hidden neurons the chance that one "path way" through the network will have non-pathological weights is larger. But happens to the rest of the network, is it just not used then?

Does anybody understand better where this robustness comes from or perhaps can offer some literature discussing this issue?

Some more information: this occurs in all training settings/architecture configurations I have investigated. For instance, activations=Relu, final_activation=sigmoid, Optimizer=Adam, learning_rate=0.1, cost_function=cross_entropy, biases were used in both layers.

Chrigi
  • 181
  • 5

2 Answers2

1

You grasped a bit of the answer.

In the latter two cases the optimisation gets stuck in suboptimal local minima.

When you have only 2 dimensions, a local minima exists. When you have more dimensions, this minima gets harder and harder to reach, as its likelihood decreases. Intuitively, you have a lot more dimensions through which you can improve than if you only had 2 dimensions.

The problem still exists, even with 1000 neurons you could find a specific set of weights which was a local minimum. However, it just becomes so much less likely.

BlueMoon93
  • 906
  • 5
  • 16
0

I may have scratched the surface of a much larger problem when I asked this question. In the meantime I have read Lottery Hypothesis paper: https://arxiv.org/pdf/1803.03635.pdf

Basically, if you overparameterise your network you are more likely to find a random initialisation that performs well: A winning ticket. The paper above shows that you can actually prune away the unneeded parts of the network after training. However, you need to overparameterise the network initially in order to increase the chance of randomly sampling a winning ticket configuration.

I believe the case in my question above is a minimal example of this.

Chrigi
  • 181
  • 5