1

I was using PCA on my whole dataset (and, after that, I would split it into training, validation, and test datasets). However, after a little bit of research, I found out that this is the wrong way to do it.

I have few questions:

  1. Are there some articles/references that explain why is the wrong way?

  2. How can I transform the validation/test set?

Steps to do PCA (from https://www.sciencedirect.com/science/article/pii/S0022460X0093390X):

  1. zero mean

$$\mu = \frac{1}{M}\sum_{i=1}^{M} x_{i}$$

where x is my training set

  1. centering (variance)

$$S^{2} = \frac{1}{M}\sum_{i=1}^{M} (x_{i}-\mu)^{T}(x_{i}-\mu)$$

  1. use (1) and (2) to transform my original training dataset

$$x_{new} = \frac{1}{\sqrt{M}} \frac{(x_{i} - \mu)}{S}$$

  1. calculate covariance matrix (actually correlation matrix)

$$C= x_{new}^T x_{new}$$

  1. take the k-eigenvectors (/phi) from the covariance matrix and defined the new space for my new dimension training set (where k are the principal components that I choose according to my variance)

$$ x_{new dim} = x_{new}\phi$$

Ok, then I have my new dimensional training dataset after PCA (till here it's right, according to other papers that I have read).

The question is: *What I have to do now for my validation/testing set? Just the equation below?

$$y_{new dim} = y\phi $$

where y is my (for example) validation original dataset?

Can someone explain the right thing to do?

nbro
  • 42,615
  • 12
  • 119
  • 217
LVoltz
  • 131
  • 1
  • 6

1 Answers1

2

for the first point I'm very sorry that I cannot give you any literature on this, but I might be able to explain you, why you don't take PCA on both datasets independently.

Principal components analysis is simply a transformation of your data into another (less dimensional) coordinate system. The axis for your new coordinate system are defined by the principal components (i.e. eigen-vectors) of your covariance matrix.

Since you will train your machine learning algorithm in the domain that is generated by the PCA, your test data must be exactly in the same domain. So as you said, you use exactly the same transformation for the test data as for the training data, i.e. $y_{newdim} = y \phi $. Of course if you applied standardization to your training data, you have to apply the same standardization to your test data. So you need to store the mean $\mu_x$ and the standard deviation $S_x$ and also standardize your test data y to \begin{equation} y_{standardized} = \dfrac{y_i - \mu_x}{S_x} \end{equation}

Note here that you have an error in your standardization formula (you do not need to divide by $M$).

The point is that the principal components of your test data $\phi_{y}$ would not match the principal components of your training data $\phi_{x}$. Thus the transformations from original space into PCA-space $\Phi_x(u)$ and $\Phi_y(u)$ would diverge and similar data points in the original space might be far away in the PCA-representation and vice versa. This is why you generate the mapping $x_{newdim} = \Phi_x(x) = x\phi$ and apply it also on the test data.

I hope I could make it clear to you.

Best