Questions about Standardization

tianqi · March 15, 2022, 7:18pm

Hi! I just get a question when I’m trying to do the data standardization. Because the dataset consists of continuous variables, discrete variables and dummy variables. Do I need to standardize the whole dataset or just need to standardize continuous variables? Thank you!

d.snow · March 19, 2022, 6:29pm

The reason we perform normalization is to get data in similar ranges. As a result the data does not have to be exactly the same. Therefore you can choose what to do.

If you have dummy variables they have values 0 or 1 which would be exactly the same range as when you perform min-max (0-1) normalization so you don’t have to worry, however, if you perform ordinal encoding, the values can range from say 1-100 and even higher (then you definitely have to normalize them).

If you perform standardization your values would be negative and positive, but the range would be similar to binary variables, so then you don’t have to worry.

My general advice when using parametric models (linear regression + neural network) is to perform all of the categorical encoding (one-hot + label encoding) and then to after that perform Standardization/Min-Max normalization on all the features. (You don’t have to do this, but it might help!)

For tree based models, I would just perform label encoding and won’t perform any normalization.

tianqi · March 21, 2022, 7:39pm

Thank you professor!