Question about Highly Correlated Features

I have a question about how to deal with highly correlated features in machine learning. In EDA section we plot the correlation matrix of all features and find several features have very large correlations such as 0.8 or 0.9. What is a reasonable way to deal with this situation in the preprocessing part? Should we use PCA to reduce the dimension or just simply remove the similar features? Or we just leave it alone and let the model to handle it? But wouldn’t similar features dilute the feature importance? Thanks.

Hi Muyi,

Regarding highly correlated factors, it is recommended to avoid having such features in your model. Though highly correlated features will not bring additional information (or just very few), they will increase the complexity of the algorithm, thus increasing the risk of errors.

The easiest way to avoid highly correlated feature is definitely delete or eliminate one of the highly correlated features.

You can also implement dimension reduction, which can be done via PCA.

I think it is really up to you to choose the method to avoid highly correlated features. Maybe you can find out which one works the best for your model.

1 Like

Understood. Thanks for your suggestions!

1 Like

That is a good answer, you can also use non-linear dimensionality reduction methods (like autoencoders) to preserve even more data, given that we will use non-linear models to perform the forecasting. There are also some models where multicollinearity doesn’t affect the model that much, like all decision tree based models.