Why does every neuron in a hidden layer of a multi-layer perceptron (MLP) typically have the same activation function as every other neuron in the same or other hidden layers (so I exclude the output layer, which typically has a different activation function) of the MLP? Is this a requirement, are there any advantages, or maybe is it just a rule of thumb?
1 Answers
As you stated, it's popular to have some form of a rectified linear unit (ReLU) activation in hidden layers and the output layer is often a softmax or sigmoid (depending also on the problem: multi-class or binary classification, respectively), which provides an output that can be viewed as a probability distribution.
You could generalize this further to blocks of different activation functions within the same layer. This is something I've thought about, haven't done, but imagine has been attempted. In some sense, the idea here would be to allow for a subsection of the network to develop a representation that may not be feasible otherwise. These different representations within the same layer would then be unified by subsequent layers as we move closer to the output.
- 42,615
- 12
- 119
- 217
- 416
- 3
- 10