Realizability Assumption: Why is that for every ERM hypothesis $L_{S}(h_{S})=0$

Question

I'm quoting Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014:

Definition 2.1 (The Realizability Assumption). There exists $h^{\star} \in \mathcal{H}$ s.t. $L(D, f )(h^{\star}) = 0$. Note that this assumption implies that with probability 1 over random samples, S, where the instances of S are sampled according to D and are labeled by f, we have $L_{S}(h^{\star})=0$.

My understanding of the second sentence in this definition is that because $h^{\star}$ satisfies the equation $L(D, f )(h^{\star}) = 0$, so every prediction made by $h^{\star}$ on every example $x$ sampled from the domain set $\mathcal{X}$ is correct (otherwise the loss $L(D, f )(h^{\star})$ will not equal 0). Equivalently, every prediction made by $h^{\star}$ is correct. Therefore, for any sample $S$ sampled from $\mathcal{X}$ we have $L_{S}(h^{\star})=0$.

However what I'm stumble upon is when the author further collaborates on this def.:

The realizability assumption implies that for every ERM hypothesis we have that $L_{S}(h_{S})=0$.

I don't quite get what the author means here since every ERM hypothesis $h_{S}$ is found based on some subjective minimization algorithm, which in turn, depends on a number of other factors, such as the choice of the loss function, the sample size, the algorithm complexity and thus may not always converge to $h^{\star}$?

score 1 · Accepted Answer · answered Oct 10 '23 at 06:14

What I was missing is the condition in the definition: "S is labeled by a function $f$".

Since $(\mathcal{},f)(ℎ^{\star})=0$ and $h^{\star}\in\mathcal{H}$, then for every ERM hypothesis $h_{\mathcal{S}}=ERM_{\mathcal{H}}(\mathcal{S})\in \underset{h\in{\mathcal{H}}}{argmin} L_{\mathcal{S}}(h)$ learned from $S$ that minimizes the loss defined by:

$$ \begin{align} L_{S}(h):=\frac{|\{i \in [m]: h(x_{i}) \ne f(x_{i}) = y_{i}\}|}{m}, \end{align} $$

we simply have that $L_{\mathcal{S}}(h_{\mathcal{S}}))=0$ otherwise choose $h^{\star}$ for $h_{\mathcal{S}}$.

This is different from, say, the agnostic ERM learner where this condition is relaxed by replacing the labeling function $f$ by a data generating distribution instead.

Realizability Assumption: Why is that for every ERM hypothesis $L_{S}(h_{S})=0$

1 Answers1