If vanishing gradients are NOT the problem that ResNets solve, then what is the explanation behind ResNet success?

Question

I often see blog posts or questions on here starting with the premise that ResNets solve the vanishing gradient problem.

The original 2015 paper contains the following passage in section 4.1:

We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN, which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve competitive accuracy, suggesting that the solver works to some extent.

So what's happened since then? I feel like either it became a misconception that ResNets solve the vanishing gradient problem (because it does indeed feel like a sensible explanation that one would readily accept and continue to propagate), or some paper has since proven that this is indeed the case.

I'm starting with the initial knowledge that it's "easier" to learn the residual mapping for a convolutional block than it is to learn the whole mapping. So my question is on the level of: why is it "easier"? And why does the "plain network" do such a good job but then struggle to close the gap to the performance of ResNet. Supposedly if the plain network has already learned reasonably good mappings, then all it has left to learn to close the gap is "residual". But it just isn't able to.

score 6 · Accepted Answer · edited Jan 17 '21 at 14:11

They explained in the paper why they introduce residual blocks. They argue that it's easier to learn residual functions $F(x) = H(x) - x$ and then add them to the original representation $x$ to get hidden representation $H(x) = F(x) + x$ than it is to learn hidden representation $H(x)$ directly from original representation. That's the main reason and empirical results show that they might be right. Better gradient propagation might be an additional bonus but that's not why they originally introduced the idea.

"Normal" networks work too but, at some point, they become too deep and start working worse than shallower versions (they empirically showed that in the paper). Again, they argue that the reason for that might be that at deeper layers hidden representations become approximately similar $H_n \approx H_{n+1}$ because representation is already well learned and you only need some slight adjustments. That would mean that transformation for deeper layers is similar to identity transformation and that ordinary layers might have trouble learning that, while for residual blocks it would be easy to learn slight modification and add that to the already existing representation from the previous layer.

If vanishing gradients are NOT the problem that ResNets solve, then what is the explanation behind ResNet success?

1 Answers1

Linked