Why do learning paradigms overfit? (Machine Learning Theory) uniform convergence

Question

I have been learning about agnostic PAC and uniform convergence with finite hypothesis classes. If we use a NN with $d$ parameters with $64$-bit precision

$$ \frac{r^2(128d+2\ln(\frac{2}{\delta}))}{\epsilon^2} $$

should be an upper bound on the number of examples needed for the sample error to be within epsilon of the generalization error for any hypothesis, where $r$ is the "range" of the loss function.

The theory suggests overfitting shouldn't occur, but it obviously does, so what am I missing?

You say "finite hypothesis classes" and PAC learning? Are neural networks finite? I don't see how the theory suggests that over-fitting shouldn't occur. Can you be more specific? Aren't they talking about bounds and number of examples needed? — nbro, Sep 11 '24 at 11:11
They are not finite but in practice computers are limited to finite bit-representations so there are roughly $2^{64*d}$ parameters for a neural network with d parameters — Joe Jameson, Sep 13 '24 at 00:54

score 0 · Accepted Answer · answered Sep 09 '24 at 04:56

Indeed PAC learning theory's uniform convergence provides a worst case lower bound on the number of samples needed for the training error to be close to the generalization error for all hypotheses in a given hypothesis class. The bound depends on the size of the hypothesis class (or its complexity, like VC dimension or Rademacher complexity), the desired error tolerance $ϵ$, and the confidence level $(1−δ)$.

However, for many practical ML models such as overparameterized neural networks, the number of possible hypotheses is extremely large due to the huge number of parameters. As a result the sample complexity worst case lower bound predicted by above PAC theory is often too high. On the other hand, in practice we rarely have a number of training samples that meets this high theoretical lower bound. Therefore these models can perform well on the training set but poorly on unseen data which is overfitting since uniform convergence doesn't apply.

Why do learning paradigms overfit? (Machine Learning Theory) uniform convergence

1 Answers1