I read that, if we use the sigmoid or hyperbolic tangent activation functions in deep neural networks, we can have some problems with the vanishing of the gradient, and this is visible by the shapes of the derivative of these functions. ReLU solves this problem thanks to its derivative, even if there may be some dead units. ResNet uses ReLU as activation function, but looking online what I understood is that ResNet solves the vanishing of the gradient thanks to its identity map, and I do not totally agree with that. So what's the purpose of the identity connections in ResNet? Are they used for solving the vanishing of the gradient? And ReLU really solves the vanishing of the gradient in deep neural networks?
1 Answers
Before proceeding, it's important to note that ResNets, as pointed out here, were not introduced to specifically solve the VGP, but to improve learning in general. In fact, the authors of ResNet, in the original paper, noticed that neural networks without residual connections don't learn as well as ResNets, although they are using batch normalization, which, in theory, ensures that gradients should not vanish (section 4.1). So, in this answer, I'm just giving a potential explanation of why ResNets may also mitigate (or prevent to some extent) the VGP, but the cited research papers below also confirm that ResNets prevent the VGP. Given that I didn't fully read all the papers mentioned in this answer, the information in this answer may not be fully accurate.
The skip connections allow information to skip layers, so, in the forward pass, information from layer $l$ can directly be fed into layer $l+t$ (i.e. the activations of layer $l$ are added to the activations of layer $l+t$), for $t \geq 2$, and, during the forward pass, the gradients can also flow unchanged from layer $l+t$ to layer $l$.
How exactly could this prevent the vanishing gradient problem (VGP)? The VGP occurs when the elements of the gradient (the partial derivatives with respect to the parameters of the NN) become exponentially small, so that the update of the parameters with the gradient becomes almost insignificant (i.e. if you add a very small number $0 < \epsilon \ll 1$ to another number $d$, $d+\epsilon$ is almost the same as $d$) and, consequently, the NN learns very slowly or not at all (considering also numerical errors). Given that these partial derivatives are computed with the chain rule, this can easily occur, because you keep on multiplying small (finite-precision) numbers (please, have a look at how the chain rule works, if you're not familiar with it). For example, $\frac{1}{5}\frac{1}{5} = \frac{1}{25}$ and then $\frac{1}{5}\frac{1}{25} = \frac{1}{125}$, and so on. The deeper the NN, the more likely the VGP can occur. This should be quite intuitive if you are familiar with the chain rule and the back-propagation algorithm (i.e. the chain rule). By allowing information to skip layers, layer $l+t$ receives information from both layer $l+t-1$ and layer $l$ (unchanged, i.e. you do not perform multiplications). For example, to compute the activation of layer $l+t-1$, you perform the usual linear combination followed by the application of the non-linear activation function (e.g. ReLU). In this linear combination, you perform multiplications between numbers that could already be quite small, so the results of these multiplications are even smaller numbers. If you use saturating activation functions (e.g. tanh), this problem can even be aggravated. If the activation of layer $l+t$ are even smaller than the activations of layer $l+t-1$, the addition of the information from layer $l$ will make these activations bigger, thus, to some extent, they will prevent these activations from becoming exponentially small. A similar thing can be said for the back-propagation of the gradient.
Therefore, skip connections can mitigate the VGP, and so they can be used to train deeper NNs.
These explanations are roughly consistent with the findings reported in the paper Residual Networks Behave Like Ensembles of Relatively Shallow Networks, which states
Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks.
In the paper Norm-Preservation: Why Residual Networks Can Become Extremely Deep?, the authors also discuss another desirable effect of skip connections.
We show theoretically and empirically that each residual block in ResNets is increasingly norm-preserving, as the network becomes deeper
 
    
    - 42,615
- 12
- 119
- 217
