I am currently studying "Understanding Machine Learning from Theory to Practice" written by Shai Shalev-Shwartz and Shai Ben-David. I want to understand how i can use the Definitions and Results of the Theory he describes in Practice.
Consider the problem of fitting a one dimensional polynomial to data; namely, our goal is to learn a function, $h : R → R$ , and as prior knowledge we consider the hypothesis class of polynomials up to degree 10. Our class $H$ has VCdim($H$)=11 so with the fundamental theorem of statistical learning it is APAC learnable with ERM. If i fix my $\epsilon,\delta \in (0,1)$, then if my sample size is equal to $ C_2\frac{11+\log(1/\delta)}{\epsilon^2}$ i can be sure that with probability of atleast $1-\delta$, ERM will output a hypothesis with \begin{equation} L_D(h_s)\leq \min_{h}L_D(h)+\epsilon. \end{equation} $C_2$ is a constant and $h_S$ is the hypotheis of the ERM algorithm. Now in practical Terms, this doesnt tell me anything about the quality of my Model. Because we dont know the underlying Distribution $D$ we can't compute $L_D(h)$ for any $h\in H$. But i can compute $L_S(h)$ for every $h\in H$ in particular $L_S(h_S)$. My intuition is then that we could use the uniform-convergence property of our class $H$ (VCdim(H) is finite) to get a bound for how much $L_D(h_S)-L_S(h_S)$ differ. With probability of atleast $1-\delta$ \begin{equation} L_D(h_S)\leq \min_{h}L_D(h)+\epsilon \leq \min_{h}{L_S(h)+\epsilon}+\epsilon=L_S(h_S)+2\epsilon \end{equation} I know my the value of $\epsilon$ so i can calculate the above expression. For example with $\epsilon=0.01$ and $L_S(h_S)=0.01$ i can gurantee that the true error of my hypotheis $h_S$ is at most $\leq 0.03$ with probability $1-\delta$.