How is back propagation applied in case the activation function is not differentiable?

Question

Back propagation is based on partial derivatives including the activation functions.

How is back propagation applied when the activation function is not differentiable?

score 1 · Accepted Answer · answered Dec 16 '23 at 08:02

How is back propagation applied when the activation function is not differentiable?

It isn't. You cannot establish gradient values through non-differentiable functions.

However, this isn't a common reason to reject an activation function. When an activation function cannot be differentiated at one or two specific values, that isn't a problem. ReLU cannot be differentiated at 0 for example, and is often a good choice. The problem at 0 can be ignored by assigning an arbitrary gradient value (typically 0 or 1).

If someone tells you the step function cannot be used in backpropagation because it is "non-differentiable", that is not strictly true. The domain of where it is differentiable is the same as ReLU. The problem with the step function is that it doesn't have a useful gradient. Many activation functions are problematic because the gradients - despite being calculatable - are not well-behaved in other ways. Very large gradients cause instability, or areas of very low/zero gradients mean training cannot progress.

If you want to train a neural network with non-differentiable activation functions, or functions that would be unstable using gradient approaches, then you cannot use backpropagation. However, you can search for optimal neural network weights in other ways, such as genetic algorithms. These search methods are typically less efficient than gradient-based approaches, but might be worth it if there is some other good reason for picking a "difficult" activation function.

How is back propagation applied in case the activation function is not differentiable?

1 Answers1

Linked