Section 5.1 of Understanding Machine Learning: From Theory to Algorithms 2014 by Shai Shalev-Shwartz and Shai Ben-David states:
- With probability of at least $\frac{1}{7}$ over the choice of $S ∼ D^m$ we have that $L_D(A(S)) ≥ \frac{1}{8}$. This theorem states that for every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. Indeed, a trivial successful learner in this case would be an ERM learner with the hypothesis class $H = \{f\}$, or more generally, ERM with respect to any finite hypothesis class that contains f and whose size satisfies the equation $m ≥ 8log(\frac{7|H|}{6}$) (see Corollary 2.3).
Also according to the book:
COROLLARY 2.3 Let H be a finite hypothesis class. Let $\delta \in (0, 1)$ and $\epsilon > 0$ and let $m$ be an integer that satisfies $m ≥ \frac{log(\frac{|H|}{\delta})}{\epsilon}$. Then, for any labeling function, $f$, and for any distribution, $D$, for which the realizability assumption holds (that is, for some $h \in H, L_{(D,f)}(h) = 0$), with probability of at least $1 − \delta$ over the choice of an i.i.d. sample $S$ of size $m$, we have that for every ERM hypothesis, $h_S$, it holds that $L_{(D,f)}(h_S) ≤ \epsilon$.
I am trying to understand how $m ≥ 8log(\frac{7|H|}{6})$ was derived.
$\delta$ is the probability of failure which is when the generalization error, $L_D(A(S))$, is greater than or equal to $\epsilon$. Therefore $\delta = \frac{1}{7}$ and $\epsilon = \frac{1}{8}$. Substituting this into COROLLARY 2.3 one gets:
$m ≥ \frac{log(\frac{|H|}{\frac{1}{7}})}{\frac{1}{8}} = 8log(7|H|)$.
Where is the 6 coming from in this case?
I understand that if $\delta = \frac{6}{7}$ we would get $8log(\frac{7|H|}{6})$ but $\frac{6}{7}$ is $1 - \delta$ which is the probability of success and not the probability of failure.