2

Why do non-linear activation functions that produce values larger than 1 or smaller than 0 work?

My understanding is that neurons can only produce values between 0 and 1, and that this assumption can be used in things like cross-entropy. Are my assumptions just completely wrong?

Is there any reference that explains this?

nbro
  • 42,615
  • 12
  • 119
  • 217

2 Answers2

2

Why wouldn't they work?

Each neuron's output is equal to a function over the sum of all its weights multiplied by their corresponding neurons. If that function is the Sigmoid function, then the output is squashed from $[0,1]$. If the entire layer uses a SoftMax function, then the output of all neurons is squashed from $[0,1]$ and their sum equals 1. In other others, they represent a set of probabilities, where you can then use cross-entropy to optimize their values (cross-entropy measures the difference between two probability distributions).

ReLU and ELU are simply other types of functions, whose output is not limited to the range $[0, 1]$. They are differentiable, like other activation functions, and so they can be used in any neural network.

nbro
  • 42,615
  • 12
  • 119
  • 217
BlueMoon93
  • 906
  • 5
  • 16
1

Christopher Olah's blog post describes it better that I ever could. Basically, most data we come across can't be separated with a single line, but with some kind of curve. Non-linearities allow us to distort the input space in ways that make the data linearly separable, making classification more accuarate.

Daniel
  • 326
  • 2
  • 9