Standardization and selection of the hyperparameters

Hi class, here I have two doubts about standardization and selection of the hyperparameters.

First, according to previous posts, we are sustrong textpposed to do the scaling after train and test split, fitting the scaler to training set and applying it to both training and test sets. Then I wonder should I develop different scaler for X and Y, or they could be fit and transformed together, or should Y even be standardized? If only X variables scaled, the final testing errors computation would be impacted?

Second, I was trying to test several parameters (for GBR) by grid search, but the runtime is still long even after I’ve reduced number of parameters and values. I wonder are there widely accepted “the most important parameters”? Or it depends, and we can select the ones which are significant in our view?

Thank you in advance!

Hi thanks, for your questions, they X and y should not be fit and transformed together. Instead we do this:

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The scaling parameters that we learn from the training set then gets used to also transform the test set data.

Yes, this can be problematic, there are definitely some parameters that matter most (in order of importance)

Parameters of interest (just for gradient boosting models):

  • The learning rate which shrinks the contribution of each successive tree.
    • A lower learning rate generally gives a better score
  • The number of estimators represents the number of trees in the forest.
    • Too many estimators slows down the models
    • And also leads to a slow decrease in out of sample performance.
    • As such you can’t increase it into infinity
  • The maximum depth parameter indicates how deep the tree can be build.
    • AdaBoost models uses stumps, which is a depth of 1
    • Don’t expect this number to be too large as it overfits.
  • The maximum random number of features to consider when searching for a best split.
  • The minimum number of samples to split an internal node.
    • Increasing this parameter the tree becomes more constrained.
    • This constraining is good and reduces the variance
  • The minimum weight samples but in the leaf.