Why do non-linear activation functions that produce values larger than 1 or smaller than 0 work?

Question

My understanding is that neurons can only produce values between 0 and 1, and that this assumption can be used in things like cross-entropy. Are my assumptions just completely wrong?

Is there any reference that explains this?

score 2 · Answer 1 · edited Jan 24 '21 at 22:53

Why wouldn't they work?

Each neuron's output is equal to a function over the sum of all its weights multiplied by their corresponding neurons. If that function is the Sigmoid function, then the output is squashed from $[0,1]$. If the entire layer uses a SoftMax function, then the output of all neurons is squashed from $[0,1]$ and their sum equals 1. In other others, they represent a set of probabilities, where you can then use cross-entropy to optimize their values (cross-entropy measures the difference between two probability distributions).

ReLU and ELU are simply other types of functions, whose output is not limited to the range $[0, 1]$. They are differentiable, like other activation functions, and so they can be used in any neural network.

score 1 · Answer 2 · answered Feb 09 '18 at 14:21

Christopher Olah's blog post describes it better that I ever could. Basically, most data we come across can't be separated with a single line, but with some kind of curve. Non-linearities allow us to distort the input space in ways that make the data linearly separable, making classification more accuarate.

Why do non-linear activation functions that produce values larger than 1 or smaller than 0 work?

2 Answers2