3

I am in the process of writing my own basic machine learning library in Python as an exercise to gain a good conceptual understanding. I have successfully implemented backpropagation for activation functions such as $\tanh$ and the sigmoid function. However, these are normalised in their outputs. A function like ReLU is unbounded so its outputs can blow up really fast. In my understanding, a classification layer, usually using the SoftMax function, is added at the end to squash the outputs between 0 and 1.

How does backpropagation work with this? Do I just treat the SoftMax function as another activation function and compute its gradient? If so, what is that gradient and how would I implement it? If not, how does the training process work? If possible, a pseudocode answer is preferred.

1 Answers1

3

Backprop through ReLU is easier than backprop through sigmoid activations. For positive activations, you just pass through the input gradients as they were. For negative activations you just set the gradients to 0.

Regarding softmax, the easiest approach is to consider it a part of the negative log-likelihood loss. In other words, I am suggesting to directly derive gradients of that loss with respect to the softmax input. The result is very elegant and extremely easy to implement. Try to derive that yourself!

ssegvic
  • 499
  • 2
  • 6