10

I have read somewhere on the web (I lost the reference) that the number of units (or neurons) in a hidden layer should be a power of 2 because it helps the learning algorithm to converge faster.

Is this a fact? If it is, why is this true? Does it have something to do with how the memory is laid down?

nbro
  • 42,615
  • 12
  • 119
  • 217
dsfx3d
  • 225
  • 2
  • 8

2 Answers2

15

I have read somewhere on the web (I lost the reference) that the number of units (or neurons) in a hidden layer should be a power of 2 because it helps the learning algorithm to converge faster.

I would quite like to see a reference to this suggestion, in case it has been misunderstood.

As far as I know, there is no such effect in normal neural networks. In convolutional neural networks, it might potentially be true in a minor way because some FFT approaches work better with $2^n$ items.

Is this a fact? If it is, why is this true? Does it have something to do with how the memory is laid down?

I would say that this is not a general fact. Instead, it seems like misunderstood advice to search some hyperparameters such as number of neurons in each layer, by increasing or decreasing by a factor of 2. Doing this and trying layer sizes of 32, 64, 128 etc should increase the speed of finding a good layer size compared to trying sizes 32, 33, 34 etc.

The main reason to pick powers of 2 is tradition in computer science. Provided there is no driver to pick other specific numbers, may as well pick a power of 2 . . . but equally you will see researchers picking multiples of 10, 100 or 1000 as "round numbers", for a similar reason.

One related factor: If a researcher presents a result for some new technique where the hidden layer sizes were tuned to e.g. 531, 779, 282 etc, then someone reviewing the work would ask the obvious question "Why?" - such numbers might imply the new technique is not generic or requires large amounts of hyperparameter tuning, neither of which would be seen as positive traits. Much better to be seen using an obvious "simple" number . . .

nbro
  • 42,615
  • 12
  • 119
  • 217
Neil Slater
  • 33,739
  • 3
  • 47
  • 66
1

There is a hardware based reasoning. Matrix multiplication is one of the central computations in deep learning. SIMD operations in CPUs happen in batch sizes, which are powers of 2.

Here is a good reference about speeding up neural networks on CPUs by leveraging SIMD instructions:

Improving the speed of neural networks on CPUs

You will notice batch sizes that are powers of 2. This is a good paper to read about implementing neural networks using SIMD instructions.

Hari
  • 111
  • 4