1

Section 5.1 of Understanding Machine Learning: From Theory to Algorithms 2014 by Shai Shalev-Shwartz and Shai Ben-David states:

  1. With probability of at least $\frac{1}{7}$ over the choice of $S ∼ D^m$ we have that $L_D(A(S)) ≥ \frac{1}{8}$. This theorem states that for every learner, there exists a task on which it fails, even though that task can be successfully learned by another learner. Indeed, a trivial successful learner in this case would be an ERM learner with the hypothesis class $H = \{f\}$, or more generally, ERM with respect to any finite hypothesis class that contains f and whose size satisfies the equation $m ≥ 8log(\frac{7|H|}{6}$) (see Corollary 2.3).

Also according to the book:

COROLLARY 2.3 Let H be a finite hypothesis class. Let $\delta \in (0, 1)$ and $\epsilon > 0$ and let $m$ be an integer that satisfies $m ≥ \frac{log(\frac{|H|}{\delta})}{\epsilon}$. Then, for any labeling function, $f$, and for any distribution, $D$, for which the realizability assumption holds (that is, for some $h \in H, L_{(D,f)}(h) = 0$), with probability of at least $1 − \delta$ over the choice of an i.i.d. sample $S$ of size $m$, we have that for every ERM hypothesis, $h_S$, it holds that $L_{(D,f)}(h_S) ≤ \epsilon$.

I am trying to understand how $m ≥ 8log(\frac{7|H|}{6})$ was derived.

$\delta$ is the probability of failure which is when the generalization error, $L_D(A(S))$, is greater than or equal to $\epsilon$. Therefore $\delta = \frac{1}{7}$ and $\epsilon = \frac{1}{8}$. Substituting this into COROLLARY 2.3 one gets:

$m ≥ \frac{log(\frac{|H|}{\frac{1}{7}})}{\frac{1}{8}} = 8log(7|H|)$.

Where is the 6 coming from in this case?

I understand that if $\delta = \frac{6}{7}$ we would get $8log(\frac{7|H|}{6})$ but $\frac{6}{7}$ is $1 - \delta$ which is the probability of success and not the probability of failure.

nbro
  • 42,615
  • 12
  • 119
  • 217
fsafe
  • 11
  • 2

1 Answers1

0

You actually need to substitute $\delta=6/7$ to account for the extra "$6$", since in COROLLARY 2.3 to arrive at the minimum training sample size $m$ required for another ERM learner $E$ with a finite hypothesis class $H$ that contains $f$ to successfully learn the failed task which is tried by an arbitrary classification learner $A$ whose failure when samples are i.i.d. generated from a specific joint-distribution $D$ is proved by the No-Free-Lunch theorem 5.1, in the sense that with probability at least $1-\delta$ the successful ERM learner $E$'s true risk is below $\epsilon$.

We can arrive at $\delta=6/7$ from the hint of exercise 1 of the same chapter of the book as follows. First we need Lemma B.1 which basically says that, $\forall a \in (0, 1), \forall r.v. Z.(Z \in [0,1] \land \mathbb{E}[Z]=\mu) \to \mathbb{P}[Z>1-a] \ge \frac{\mu-(1-a)}{a} \land \mathbb{P}[z>a] \ge \mu-a$

Then from the critical result (5.2) as $\mathbb{E}[L_D(A(S))] \ge 1/4$, we can substitute $Z=L_D(A(S)), \mu \ge 1/4$ along with $a=7/8, a=1/8$ to Lemma B.1, then we can easily obtain both $\mathbb{P}[Z>1/8] \ge 1/7$ and $\mathbb{P}[Z>1/8] \ge 1/8$, so finally we merge them and get a single result for the failed arbitrary learner $A$ as $\mathbb{P}[L_D(A(S))>1/8] \ge 1/7$, which implies on the contrary the successful ERM learner $E$'s true risk must be bounded by $\mathbb{P}[L_D(E(S)) \le 1/8] \ge 1/7$ since for each case of sample $S$ with size $m$ drawn i.i.d. from $D$ where $A$ fails ($L_D(A(S))>1/8$) $E$ succeeds ($L_D(E(S)) \le 1/8$), so $\delta=6/7$ in the sense that with probability at least $1/7$ the successful ERM learner $E$'s true risk is below $1/8$.

cinch
  • 11,000
  • 3
  • 8
  • 17