What is the formula for $VC(H)$ in Computational Learning Theory?

Question

Courtesy link: How would you intuitively but rigorously explain what the VC dimension is?

Trying to explain the idea of VC to some of my colleagues I've discovered quite an intuitive way of laying out the basic idea. Without going through lots of math and notation as I've done in my other answer.

Imagine a following game between two players $\alpha$ and $\beta$ :

First, player $\alpha$ plots $d=4$ points on a piece of paper. She may place the points however she likes.

Next, player $\beta$ marks several of the drawn points.

Finally, player $\alpha$ should draw a circle such that all the marked points are inside the circle, and all the unmarked points - outside. (Points on the boundary considered "inside".)

The player $\alpha$ wins if she can draw such a circle at step #3. The player $\beta$ wins if making such circle is impossible.

If you try to analyze this game then you'll notice that the player $\beta$ has a winning strategy. For any $d=4$ points on a plane, there is always a subset such that player $\alpha$ is unable to draw a required circle. ( I don't want to go into the detailed proof of the strategy - it is straightforward, but cumbersome - I've sketched it my other answer). If we now change the number of points to $d=3$ then the game suddenly becomes winnable by player $\alpha$ - for three points that are not on the same line any subset can be separated by a circle.

The largest number $d$ at which the game is winnable by player $\alpha$ is called the VC dimension of our classification set. So, in the case of 2D discs (insides of a circle) the VC dimension is 3. If one changes the rules to use rectangles instead of circles, then the maximum number of points winnable by $\alpha$ would be 4 - thus, the VC dimension of rectangular classification sets is 4.

Restoring our the mathematical notation, we denote $\mathcal{X} = \mathbb{R}^2$ our two-dimensional plane. $C$ is the subset of cardinality $|C| = d$ that the player $\alpha$ selects. And $\mathcal{H}$ is a class (a "set-of-sets") of the subsets of $\mathcal{X}$ that one should use as a classification boundary. > Formally, the statement that the game above is winnable by $\alpha$ can be written as:

$$ \exists\;C\subset\mathcal{X}.|C|=d \Rightarrow\forall\; A\subset C\; \exists\; B\in\mathcal{H}\;. A = B\,\cap\,A $$

The maximal $d$ at which this statement is true would be the VC dimension. (I actually worked backwards from noticing the alternating quantifiers $\exists\,\forall\,\exists$ in the VC definition - which is typical in game playing, so I worked back from the definition to make the game above.)

This chapter discusses frameworks for analyzing learning algorithms, such as the Probably Approximately Correct (PAC) and Vapnik–Chervonenkis (VC) theories, with relevance to deep learning.

How can I look at the VC-Dimension like a measure of capacity?

Nope · Accepted Answer · 2024-06-04T23:07:08.350

You can look at the VC-Dimension like a measure of capacity (like richness, expressive power, flexibility, complexity) of the hypotesis space $H$ of a model: with it, we can extend the PAC Analysis to problems having infinite hypotesis space.

Intuitively, you can think about $VC(H)$ as the cardinality of the largest set of points (your instance space $X$) the algorithm itself can shatter. This special ability the algorithm has is strictly connected to the concept of inductive bias embedded in $H$. That's the neat part: why?

Imagine having a bias-free learner, i.e. a learner that makes no real prior assumptions about the target concept: obviously there won't be any rationale behind its classification of new, unseen instances. Injecting the inductive bias, we are effectively converting our learner into a deductive system, one thatjumps to logical conclusions entailed by the premises of our training examples from the assumed hypotesis in H. Now the learner can define a new policy for generalising beyond observed data in a non-procedural way starting from what he learnt about the training examples.

Now: an unbiased $H$ is one that shatter the instance space $X$, but we want a biased $H$, so we are almost certainly losing the power to shatter the whole instance space $X$.

But what about a $Q \subset X$? Moreover, given that the learner can shatter this subset $Q$ of $X$, what does it can tell us about $H$? Surely we can say that the expressiviness of $H$ is directly proportional to the size of $Q$, i.e. how large $Q$ is. How do we measure this expressiviness?

With VC-Dimension. That's it: $VC(H)$ is, for $H$ defined over $X$, the size of the largest finite subset $Q \subset X$ that can be shattered by $H$.

Two details:

If the size of $Q$ is arbitrarily large and still $H$ can shatter it, then $VC(H) = \infty$
For any finite $H$, $VC(H) \leq \log_2(|H|)$, clearly because $H$ needs $2^n$ different hypoteses to shatter $n$ instances if $VC(H) = n$, but it's not an upper bound: matter of fact, you can have $|H| \geq 2^n$, that's why $log_2(|H|) \geq VC(H) = n$.

What is the formula for $VC(H)$ in Computational Learning Theory?

1 Answers1