Why does TensorFlow docs discourage using softmax as activation for the last layer?

Question

The beginner colab example for tensorflow states:

Note: It is possible to bake this tf.nn.softmax in as the activation function for the last layer of the network. While this can make the model output more directly interpretable, this approach is discouraged as it's impossible to provide an exact and numerically stable loss calculation for all models when using a softmax output.

My question is, then, why? What do they mean by impossible to provide an exact and numerically stable loss calculation?

ted · Accepted Answer · 2020-04-14T01:51:20.540

It's because of gradient computations: automatic differentiation will compute the gradient for each module and if you have a standalone crossentropy module the over all loss will be unstable (~1/x so it will diverge for small input values) whereas if you use a softmax + crossentropy module all-in-one, then it becomes numerically stable (y-p)

Slides from DeepMind's Simon Osindero lecture at UCL in 2016:

xeonqq · Answer 2 · 2020-12-17T12:09:45.357

This is also a question I stumble upon, thanks for the explaination from ted, it is very helpfull, I will try to elaborate a little bit. Let's still use DeepMind's Simon Osindero's slide: The grey block on the left we are looking at is only a cross entropy operation, the input $x$ (a vector) could be the softmax output from previous layer (not the input for the neutral network), and $y$ (a scalar) is the cross entropy result of $x$. To propagate the gradient back, we need to calculate the gradient of $dy/dx_i$, which is $-p_i/x_i$ for each element in $x$. As we know the softmax function scale the logits into the range [0,1], so if in one training step, the neutral network becomes super confident and predict one of the probabilties $x_i$ to be 0 then we have a numerical problem in calculting $dy/dx_i$.

While in the other case where we take the logits and calculate the softmax and crossentropy at one shot (XentLogits function), we don't have this problem. Because the derivative of XentLogits is $dy/dx_i = y - p_i$, a more elaborated derivation can be found here.

Why does TensorFlow docs discourage using softmax as activation for the last layer?

2 Answers2