While reading the residual learning paper [1], I found a problem to be quite unanswered. Suppose I am stacking a deep neural network to map an input to a output. Lets say a function H(x) does the required mapping. The function F(x) represents the residual to be learnt by the layers bypassing the input (x) only to be introduced/added before the output.
Consider a case where my H(x) maps any given instance of variable input x to 0 i.e. H(x) = 0 ∀x∈R. Now for this case F(x) = H(x) - x i.e. F(x) = -x ∀x∈R.
My question is doesn't learning F(x) = -x will be tougher for neural networks as compared to a simple mapping of H(x) = 0? Please explain.