Why do we use the softmax instead of no activation function?

Question

Why do we use the softmax activation function on the last layer?

Suppose $i$ is the index that has the highest value (in the case when we don't use softmax at all). If we use softmax and take $i$th value, it would be the highest value because $e$ is an increasing function, so that's why I am asking this question. Taking argmax(vec) and argmax(softmax(vec)) would give us the same value.

Kostya · Accepted Answer · 2021-05-07T17:22:03.900

Short answer: Generally, you don't need to do softmax if you don't need probabilities. And using raw logits leads to more numerically stable code.

Long answer: First of all, the inputs of the softmax layer are called logits.

During evaluation, if you are only interested in the highest-probability class, then you can do argmax(vec) on the logits. If you want probability distribution over classes, then you'll need to exponentiate and normalize to 1 - that's what softmax does.

During training, you'd need to have a loss function to optimize. Your training data contains true classes, so you have your target probability distribution $p_i$, which is 1 at your true class and 0 at all other classes. You train the network to produce a probability distribution $q_i$ as an output. It should be as close to the target distribution $p_i$ as possible. The "distance" measure between two probability distribution is called cross-entropy:

$$ H = - \sum p_i \log q_i $$ As you can see, you only need logs of the output probabilities - so the logits will suffice to compute the loss. For example, the keras standard CategoricalCrossentropy loss can be configured to compute it from_logits and it mentions that:

Using from_logits=True is more numerically stable.

Why do we use the softmax instead of no activation function?

1 Answers1

Linked