Suppose we have a perceptron without bias and $f(x) = x$ as activation function and matrices $X,Y,W$ that input training data are columns of matrix $X$, $Y$ is targets matrix (columns are ordered with attention to the related inputs) and $W$ is the weights matrix in perceptron. Also $X_i$ denotes the $i$th column in matrix $X$ and so on for other matrices. We know we want to minimize this:
$$ E(W) = \sum\limits_{i} ||WX_i - Y_i||^{2} $$
In a class I saw someone said we can find the optimum $W$ by by calculating $\frac{d}{dW}E(W)$. Then with attention to the \begin{align*} ||V||^{2} = V^{T}V \ \ \ \ \ \ \ \ \ \ \ \ \ \ & (1) \end{align*} , when $V$ is a columnar vector, he continued:
\begin{align*} \frac{d}{dW}E(W) = 0 & \Rightarrow (WX - Y)X^{T} = 0 \\ & \Rightarrow W = YX^{T}(X^{2})^{-1} \end{align*}
He claimed that above formula gets the optimum $W$. But I do not know is that true and how does he calculate $\frac{d}{dW}E(W)$? If it is not true, is there any correct direct formula for achieving $W$? Also can you provide references for more studying?