3

While looking into the vanishing gradient problem, I came across a paper (https://ieeexplore.ieee.org/abstract/document/9336631) that used artificial derivatives in lieu of the real derivatives. For a visualization, see the attached image:

As you can see, the artificial derivative of the ReLU is the sigmoid function, which smoothes out the derivative to circumvent the dying ReLU problem. The authors apply similar smoothing operations for other activation functions such as the sigmoid. This idea seems pretty convincing, and it surprises me that it has not been used thus far, especially considering there is no additional computational cost in replacing the derivatives, as the authors claim. So why is it that vanishing gradients have not been addressed in this way in conventional deep learning frameworks such as Pytorch or Tensorflow?

1 Answers1

4

This idea seems pretty convincing

Indeed, you don't have to use the exact gradient of the activation function during the backward step.

The gradient of the activation function is a multiplicative part of the gradient of the loss. However, the exact gradient of the loss is almost never used. It is used neither in stochastic gradient descent, mini-batch, training with momentum, Adam, Newton nor in most other methods.

As only the approximate gradient of the loss is used in practice, we don't have to keep any part of it exact, so there is no need to keep the gradient of the activation function exact. It can be taken approximately, and any approximation is fine (including from the paper you referenced) as long as the loss decreases during training.

So why is it that vanishing gradients have not been addressed in this way in conventional deep learning frameworks such as Pytorch or Tensorflow?

This can be done in PyTorch and TensorFlow. You can modify the gradient by specifying the backward() method in PyTorch or grad() method in TensorFlow:

Extending PyTorch with Custom Activation Functions

tf.custom_gradient