Why any set of m data points with different features can be perfectly fit by a polynomial of degree n as long as n ≥ m

Question

On p.36 in "Machine Learning: The Basics", Alexander Jung, Spinger, the author wrote:

The fundamental theorem of algebra tells us that any set of m data points with different features can be perfectly fit by a polynomial of degree n as long as n ≥ m.

Where n is the length of the weight vector w in $\mathbf{R}^{n}$. To be specific, the hypothesis space is:

$\mathcal{H}^{(n)}_{poly} = \{h:R^{n} \rightarrow R:h(\mathbf{x})=\mathbf{w}^{T}\mathbf{x} \ with \ some\ vector\ \mathbf{w}∈\mathbf{R}^{n}\}.$

But I can't relate the fundament theorem of algebra to the existence of the hypothesis (or the weight vector). Can someone help to explain?

score 1 · Accepted Answer · answered Oct 05 '23 at 16:18

First, there's a mistake or typo in the quoted statement. The requirement should be $n \geq m-1,$ not $n \geq m.$ For instance, you can fit two points $(m=2)$ with a line $(n=1).$

Second, it's most straightforward to see this with simple linear algebra, not the fundamental theorem of algebra. I'll explain below.

In general, if you have a $n$th-degree polynomial $p(x) = a_0 + a_1 x + \ldots + a_n x^n,$ and you're fitting it to the set of $m$ data points $\{ (x_1,y_1), \ldots, (x_m,y_m) \},$ then you're trying to solve the following system of linear equations:

\begin{align*} y_1 &= p(x_1) = a_0 + a_1 x_1 + \ldots + a_n x_1^n \\ &\vdots \\ y_m &= p(x_m) = a_0 + a_1 x_m + \ldots + a_n x_m^n \\ \end{align*}

Let's write this using vectors:

\begin{align*} \begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix} = \begin{bmatrix} a_0 \\ \vdots \\ a_0 \end{bmatrix} + \begin{bmatrix} a_1x_1 \\ \vdots \\ a_1x_m \end{bmatrix} + \cdots + \begin{bmatrix} a_nx_1^n \\ \vdots \\ a_nx_m^n \end{bmatrix} \end{align*}

And simplify a bit:

\begin{align*} \begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix} = a_0 \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix} + a_1 \begin{bmatrix} x_1 \\ \vdots \\ x_m \end{bmatrix} + \cdots + a_n \begin{bmatrix} x_1^n \\ \vdots \\ x_m^n \end{bmatrix} \end{align*}

In other words, a solution exists if the $y$-vector is in the span of the $x$-vectors:

\begin{align*} \begin{bmatrix} y_1 \\ \vdots \\ y_m \end{bmatrix} \in \textrm{span} \left \{ \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix}, \begin{bmatrix} x_1 \\ \vdots \\ x_m \end{bmatrix}, \cdots, \begin{bmatrix} x_1^n \\ \vdots \\ x_m^n \end{bmatrix} \right \} \end{align*}

Since the $y$-values are arbitrary, this is equivalent to saying that the $x$-vectors span all possible $y$-vectors in $\mathbb{R}^m{:}$

\begin{align*} \textrm{span} \left \{ \begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix}, \begin{bmatrix} x_1 \\ \vdots \\ x_m \end{bmatrix}, \cdots, \begin{bmatrix} x_1^n \\ \vdots \\ x_m^n \end{bmatrix} \right \} = \mathbb{R}^m \end{align*}

You can span $\mathbb{R}^m$ with and only with a set of $m$ independent vectors in $\mathbb{R}^m.$ So, let's count off $m$ vectors:

The $1$st vector is $\begin{bmatrix} 1 \\ \vdots \\ 1 \end{bmatrix}$ which you can think of as $\begin{bmatrix} x_1^0 \\ \vdots \\ x_m^0 \end{bmatrix},$
the components of the $2$nd vector $\begin{bmatrix} x_1 \\ \vdots \\ x_m \end{bmatrix},$ i.e. $\begin{bmatrix} x_1^1 \\ \vdots \\ x_m^1 \end{bmatrix},$ have exponents $1,$
the components of the $3$rd vector (not shown above, but it would be $\begin{bmatrix} x_1^2 \\ \vdots \\ x_m^2 \end{bmatrix}$) have exponents $2,$
...
the components of the $m$th vector have exponents $m-1.$

So, we need $n$ to be at least $m-1.$

score 0 · Answer 2 · answered Oct 05 '23 at 20:01

Justing gave an amazing answer, however I would like to give a more "empirical" way of looking at it

Say you have a 2D plane, you randomly set 2 point, then it's easy to see that you can intersect perfectly the 2 points with a line

Say you have a 3D plane, you randomly set 3 point, then it's easy to see that you can intersect perfectly the 3 points with a plane

Then you can just go and extend this to any dimension (losing the visual representation)

(Note: not always you can have this "generalization in N dimensions", but in this case it works, and it's correct)

Why any set of m data points with different features can be perfectly fit by a polynomial of degree n as long as n ≥ m

2 Answers2