2

BACKGROUND: There is a lot of information online about the problem of multicollinearity as it relates to machine learning and how to identify correlated features. However, I am still unclear on which variables to eliminate once a correlated subset of feature variables has been identified.

QUESTION: Once a correlated set of 2 or more feature variables has been identified by some method, how does one decide which one to retain?

nbro
  • 42,615
  • 12
  • 119
  • 217
Snehal Patel
  • 1,037
  • 1
  • 4
  • 27

2 Answers2

1

I appreciate you for asking the question. Well, speaking of statistics, the problem of multicollinearity is catered to using partial correlation. Also, The correlation matrix is analyzed to understand the impact of independent features on the target variable (Output). It's quite a good practice to eliminate features which have very less or no correlation with the target.

But if you are worried about the multicollinearity issue then see the correlation across the target variable of all features if whichever feature has less correlation with the target variable drop it. Let's say A B C D E are five variables where E is the target and others are features determining E. So if A and B have a correlation of 0.7. A and E have 0.8, and B and E have 0.7. Then it makes sense to drop B. Reason: Since we know, A and B are correlated and A is also correlated with the target variable. Then the impact of the B over E may be because of the fact that it's correlated with the feature A. Therefore it's a feature that can be dropped leaving no impact on the model.

But again, just compare both the results when keeping B in the set and when excluding it to witness the difference. One of the issues of multicollinearity is faced in the classification tasks do check out this blog post

oseekero
  • 21
  • 5
1

In practice multicollinearity could be very common if your features really act as correlated causes for your target. If multicollinearity is moderate or you're only interested in using your trained ML model to predict out of sample data with some reasonable goodness-of-fit stats and not concerned with understanding the causality between the predictor variables and target variable, then multicollinearity doesn’t necessarily need to be resolved, even a simple multivariable linear regression model could potentially work well.

In case you really do need to address multicollinearity, then the quickest fix and often an acceptable solution in most cases is to remove one or more of the highly correlated variables. Specifically, you may want to keep the variable that has the strongest relationship with the target per domain knowledge and that has the least overlap with other retained variables as this is intuitively to be the most informative for prediction.

Secondly you can try linearly combine the predictor variables in some way such as adding or subtracting them. By doing so, you can create new variables that encompasses the information from several correlated variables and you no longer have an issue of multicollinearity. If still troublesome to decide which to retain, you can employ dimensionality reduction techniques such as principal component analysis (PCA) or partial least squares (PLS) or regularization techniques such as Lasso or Ridge regression which can be used to identify the most important variables in a correlated set.

cinch
  • 11,000
  • 3
  • 8
  • 17