I’m a little confused about when and how to conduct normalization and standardization. First, if I conduct normalization or standardization, do I need to normalize all features, or it’s ok for me to just normalize some features which have extremely high or low value?

Second, fitting the same model on initial data and data after feature scaling will get different scores. So if I want to compare the scores of using different models, do I need to apply those models on the same dataset after same feature scaling, or I can apply some models like linear regression and neutral network on normalized data while others on initial data and directly compare their error?

Third, if I predict log of insurance charges, can I just use the predicted log of insurance charges and log of real charges to compute MSE, or I need to transfer log prediction back before computing MSE and r2?

Thank you very much!

These are all very good questions.

First, if I conduct normalization or standardization, do I need to normalize all features, or it’s ok for me to just normalize some features which have extremely high or low value?

When using neural networks, logistic regression, linear regression, KNN, and clustering models, it is preferred that all the data are normalized. The reason is that gradient descent and models using distance functions perform poorly when all the data are not in the same range. Features in tree based models like random forest or gradient boosting models do not require any form of normalization.

Just one more point to make here, even though you don’t always normalize features for tree based models it doesn’t necessarily hurt, and you might still get the same score with our without the normalization. And the last point – although we might not normalize the features, it could be very beneficial to transform the target variable (e.g., log transform).

So if I want to compare the scores of using different models, do I need to apply those models on the same dataset after same feature scaling, or I can apply some models like linear regression and neutral network on normalized data while others on initial data and directly compare their error?

Yes to the last part of your question. You will compare each model based on the best representation of the data for that specific model, e.g., you will normalize and one-hot encode the data for a linear regression model and you will compare that model against a random forest model that might just require label encoding and no standardization.

Third, if I predict log of insurance charges, can I just use the predicted log of insurance charges and log of real charges to compute MSE, or I need to transfer log prediction back before computing MSE and r2?

You can perform this however you want, if you reverse the transformation back into its original form then you might have more interpretable metrics, e.g., if you calculate the MAE (mean absolute error), I might understand exactly what it means when using the original values.