Direct formula for calculating the optimum matrix which minimizes the perceptron error

Question

Suppose we have a perceptron without bias and $f(x) = x$ as activation function and matrices $X,Y,W$ that input training data are columns of matrix $X$, $Y$ is targets matrix (columns are ordered with attention to the related inputs) and $W$ is the weights matrix in perceptron. Also $X_i$ denotes the $i$th column in matrix $X$ and so on for other matrices. We know we want to minimize this:

$$ E(W) = \sum\limits_{i} ||WX_i - Y_i||^{2} $$

In a class I saw someone said we can find the optimum $W$ by by calculating $\frac{d}{dW}E(W)$. Then with attention to the \begin{align*} ||V||^{2} = V^{T}V \ \ \ \ \ \ \ \ \ \ \ \ \ \ & (1) \end{align*} , when $V$ is a columnar vector, he continued:

\begin{align*} \frac{d}{dW}E(W) = 0 & \Rightarrow (WX - Y)X^{T} = 0 \\ & \Rightarrow W = YX^{T}(X^{2})^{-1} \end{align*}

He claimed that above formula gets the optimum $W$. But I do not know is that true and how does he calculate $\frac{d}{dW}E(W)$? If it is not true, is there any correct direct formula for achieving $W$? Also can you provide references for more studying?

score 3 · Accepted Answer · answered Jan 03 '24 at 21:10

The idea is correct, the last formula is wrong. In general $X$ will not be square, usually one has much more data than parameters. The data points will also be in general position, so that $X$ has maximal rank and $XX^T$ is invertible. The equation then transforms to $$ WXX^T=YX^T\implies W=YX^T(XX^T)^{-1}. $$ There is no way to get an $X^2$ into the formula.

With a SVD $X=U\Sigma V^T$ where $U$ is square, $\Sigma$ invertible and $V$ isometric one gets $W=YV\Sigma^{-1}U^T$, which is connected with the idea of a pseudo-inverse.

OmG · Answer 2 · 2024-01-03T15:46:38.070

As you understand, $E$ is the definition of loss function. This function defines square of the difference between weights applied to $X_i$, namely output of the perception, and $Y_i$ the desired target value (as a ground truth). So, for finding the optimum $W$ we should find a one that make the derivative of the loss function zero, based on the given $X$s and $Y$s.

As, the error function $E$ is a convex function (a square function which is always positive), our finding from making the error function zero must minimize the error function.

So, just we need to compute $\frac{d}{dW} E(W)$ (derivative of the error function over $W$). In general derivative of a function in the form of $(zx+b)^2$ based on $z$ is equal to $2 \times (zx+b) \times x$. Similar to this case, you can do the same for matrices. Hence the derivative of each term in $E$ can be phrased as mentioned in the question, i.e., by getting derivative of each term of the sum in $E$ and rewriting in a matrix format (the mapping between your question and this example is $z = W$, $x = X_i$, and $b = -Y_i$). Note that multiplication of $2$ in each term of the result has been removed in the case, as it does not impact on the root of the equation as a constant factor.

Direct formula for calculating the optimum matrix which minimizes the perceptron error

2 Answers2