Questions tagged [vanishing-gradient-problem]

For questions related to the vanishing gradient problem, which is a numerical problem that occurs while training a (deep) neural network with a gradient-based optimization technique. There's also the related exploding gradient problem.

23 questions
7
votes
1 answer

Why do ResNets avoid the vanishing gradient problem?

I read that, if we use the sigmoid or hyperbolic tangent activation functions in deep neural networks, we can have some problems with the vanishing of the gradient, and this is visible by the shapes of the derivative of these functions. ReLU solves…
6
votes
1 answer

If vanishing gradients are NOT the problem that ResNets solve, then what is the explanation behind ResNet success?

I often see blog posts or questions on here starting with the premise that ResNets solve the vanishing gradient problem. The original 2015 paper contains the following passage in section 4.1: We argue that this optimization difficulty is unlikely…
5
votes
2 answers

What are the common pitfalls that we could face when training neural networks?

Apart from the vanishing or exploding gradient problems, what are other problems or pitfalls that we could face when training neural networks?
5
votes
1 answer

What effect does batch norm have on the gradient?

Batch norm is a technique where they essentially standardize the activations at each layer, before passing it on to the next layer. Naturally, this will affect the gradient through the network. I have seen the equations that derive the…
5
votes
1 answer

How to detect vanishing gradients?

Can vanishing gradients be detected by the change in distribution (or lack thereof) of my convolution's kernel weights throughout the training epochs? And if so how? For example, if only 25% of my kernel's weights ever change throughout the epochs,…
4
votes
0 answers

Why does sigmoid saturation prevent signal flow through the neuron?

As per these slides on page 35: Sigmoids saturate and kill gradients. when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. the gradient and almost no signal will flow through the neuron…
3
votes
3 answers

Why is the vanishing gradient problem especially relevant for a RNN and not a MLP

I would like to know why the vanishing gradient problem especially relevant for a RNN and not a MLP (multi-layer-pereptron). In a MLP you also backpropagate errors and multiple different weigths. If the weights are small, the resulting update in the…
3
votes
1 answer

Why aren't artificial derivatives used more often to solve the vanishing gradient problem?

While looking into the vanishing gradient problem, I came across a paper (https://ieeexplore.ieee.org/abstract/document/9336631) that used artificial derivatives in lieu of the real derivatives. For a visualization, see the attached image: As you…
3
votes
0 answers

Would a different learning rate for every neuron and layer mitigate or solve the vanishing gradient problem?

I'm interested in using the sigmoid (or tanh) activation function instead of RELU. I'm aware of RELU advantages on faster computation and no vanishing gradient problem. But about vanishing gradient, the main problem is about the backpropagation…
3
votes
3 answers

How do LSTM and GRU avoid to overcome the vanishing gradient problem?

I'm watching the video Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorflow Tutorial | Edureka where the author says that the LSTM and GRU architecture help to reduce the vanishing gradient problem. How do LSTM and GRU…
2
votes
1 answer

Why are all the gradients values 0 except for the first iteration?

I am fine-tuning a mistral-7b with Hugging Face peft and quantization. In my training loop, I am printing the gradient values for each batch, which seem a bit unusual. # Print gradients for name, param in model_init.named_parameters(): if…
2
votes
1 answer

Can attention models be replaced by non-sigmoid activation functions?

As far as I understand, the attention model in a LLM is used to mitigate the vanishing gradient problem. When using activation functions like the sigmoid function, deep neural networks may lead to gradients that are very close to zero (because you…
2
votes
1 answer

How does vanish gradient restrict RNN to not work for long range dependencies?

I am really trying to understand deep learning models like RNN, LSTMs etc. I have gone through many tutorials of RNN and have learned that RNN cannot work for long Range dependencies, like: Consider trying to predict the last word in the text “I…
2
votes
0 answers

How to decide if gradients are vanishing?

I am trying to debug a convolutional neural network. I am seeing gradients close to zero. How can I decide whether these gradients are vanishing or not? Is there some threshold to decide on vanishing gradient by looking at the values? I am getting…
1
vote
1 answer

Might use of rational numbers and calculations be beneficial for an ANN?

Rational numbers would help alleviate some gradient issues by not losing precision as the weights and the propagated values (signal) reach extremely low and high values. I'm not aware of any hardware that is optimized for rationals. GPUs are all…
1
2