0

While reading the residual learning paper [1], I found a problem to be quite unanswered. Suppose I am stacking a deep neural network to map an input to a output. Lets say a function H(x) does the required mapping. The function F(x) represents the residual to be learnt by the layers bypassing the input (x) only to be introduced/added before the output.

Consider a case where my H(x) maps any given instance of variable input x to 0 i.e. H(x) = 0 ∀x∈R. Now for this case F(x) = H(x) - x i.e. F(x) = -x ∀x∈R.

My question is doesn't learning F(x) = -x will be tougher for neural networks as compared to a simple mapping of H(x) = 0? Please explain.

ishaan
  • 1
  • 2

1 Answers1

0

First of all, you can always construct an example to break assumptions.

"Say I take an MLP and generate data according to that MLP, then that MLP is the optimal solution, and anything you add on top of it worsens it, including naive regularization"

Going on your toy problem:

  1. there is a reason if the title says "for Image Recognition"
  2. your counterexample works only for 1D mappings
  3. The problem that the residual network helps with is OPTIMIZATION. In your example, no optimization has been done

Is X universally unilaterally unanimously better?
Probably not. Let's all welcome the "no free lunch theorem".

Is X probably going to help?
We can talk about it

Are residual connections going to help you?
If you have a deep network, most likely

Alberto
  • 2,863
  • 5
  • 12