Grid search over any machine studying pipeline step utilizing an EstimatorSwitch
A quite common step in constructing a machine studying mannequin is to grid search over a classifier’s parameters on the prepare set, utilizing cross-validation, to seek out probably the most optimum parameters. What’s much less identified, is you could additionally grid search over just about any pipeline step, reminiscent of characteristic engineering steps. E.g. which imputation technique works finest for numerical values? Imply, median or arbitrary? Which categorical encoding methodology to make use of? One-hot encoding, or perhaps ordinal?
On this article, I’ll information you thru the steps to have the ability to reply such questions in your personal machine-learning tasks utilizing grid searches.
To put in all of the required Python packages for this text:
pip set up extra-datascience-tools feature-engine
The dataset
Let’s contemplate the next quite simple public area information set I created which has two columns: last_grade
and passed_course
. The final grade column incorporates the grade the scholar achieved on their final examination and the handed course column is a boolean column with True
if the scholar handed the course and False
if the scholar failed the course. Can we construct a mannequin that predicts whether or not a scholar handed the course based mostly on their final grade?
Allow us to first discover the dataset:
import pandas as pddf = pd.read_csv('last_grades.csv')
df.isna().sum()
OUTPUT
last_grade 125
course_passed 0
dtype: int64
Our goal variable course_passed
has no nan
values, so no want for dropping rows right here.
After all, to forestall any information leakage we should always cut up our information set right into a prepare and take a look at set first earlier than persevering with.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(
df[['last_grade']],
df['course_passed'],
random_state=42)
As a result of most machine studying fashions don’t permit for nan
values, we should contemplate totally different imputation methods. After all, usually, you’ll begin EDA (explorative information evaluation) to find out whether or not nan
values are MAR (Lacking at Random) MCAR (Lacking Utterly at Random) or MNAR (Lacking not at Random). A great article that explains the variations between these will be discovered right here:
As a substitute of analyzing why for some college students their final grade is lacking, we’re merely going to attempt to grid search over totally different imputation methods as an instance the right way to grid search over any pipeline step, reminiscent of this characteristic engineering step.
Let’s discover the distribution of the impartial variable last_grade
:
import seaborn as snssns.histplot(information=X_train, x='last_grade')
It seems just like the final grades are usually distributed with a imply of ~6.5 and values between ~3 and ~9.5.
Let’s additionally have a look at the distribution of the goal variable to find out which scoring metric to make use of:
y_train.value_counts()
OUTPUT
True 431
False 412
Identify: course_passed, dtype: int64
The goal variable is roughly equally divided, which suggests we are able to use scikit-learn’s default scorer for classification duties, which is the accuracy rating. Within the case of an unequally divided goal variable the accuracy rating isn’t correct, use e.g. F1 as a substitute.
Grid looking
Subsequent, we’re going to arrange the mannequin and the grid-search and run it by simply optimizing the classifier’s parameters, which is how I see most information scientists use a grid-search. We’ll use feature-engine’s MeanMedianImputer
for now to impute the imply and scikit-learn’s DecisionTreeClassifier
for predicting the goal variable.
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCVfrom feature-engine.imputation import MeanMedianImputer
mannequin = Pipeline(
[
("meanmedianimputer", MeanMedianImputer(imputation_method="mean")),
("tree", DecisionTreeClassifier())
]
)
param_grid = [
{"tree__max_depth": [None, 2, 5]}
]
gridsearch = GridSearchCV(mannequin, param_grid=param_grid)
gridsearch.match(X_train, y_train)
gridsearch.prepare(X_train, y_train)
pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth']
].sort_values('rank_test_score')
As we are able to see from the desk above, utilizing GridsearchCV
we discovered that we are able to enhance the accuracy of the mannequin by ~0.55 simply by altering the max_depth
of the DecisionTreeClassifier
from its default worth None to 5. This clearly illustrates the optimistic affect grid looking can have.
Nevertheless, we don’t know whether or not imputing the lacking last_grades
with the imply is definitely one of the best imputation technique. What we are able to do is definitely grid search over three totally different imputation methods utilizing extra-datascience-tools’ EstimatorSwitch
:
- Imply imputation
- Median imputation
- Arbitrary quantity imputation (by default 999 for feature-engine’s
ArbitraryNumberImputer
.
from feature_engine.imputation import (
ArbitraryNumberImputer,
MeanMedianImputer,
)
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from extra_ds_tools.ml.sklearn.meta_estimators import EstimatorSwitch# create a pipeline with two imputation methods
mannequin = Pipeline(
[
("meanmedianimputer", EstimatorSwitch(
MeanMedianImputer()
)),
("arbitraryimputer", EstimatorSwitch(
ArbitraryNumberImputer()
)),
("tree", DecisionTreeClassifier())
]
)
# specify the parameter grid for the classifier
classifier_param_grid = [{"tree__max_depth": [None, 2, 5]}]
# specify the parameter grid for characteristic engineering
feature_param_grid = [
{"meanmedianimputer__apply": [True],
"meanmedianimputer__estimator__imputation_method": ["mean", "median"],
"arbitraryimputer__apply": [False],
},
{"meanmedianimputer__apply": [False],
"arbitraryimputer__apply": [True],
},
]
# be a part of the parameter grids collectively
model_param_grid = [
{
**classifier_params,
**feature_params
}
for feature_params in feature_param_grid
for classifier_params in classifier_param_grid
]
Some vital issues to note right here:
- We enclosed each imputers within the Pipeline inside extra-datascience-tools’
EstimatorSwitch
as a result of we don’t wish to use each imputers on the similar time. It’s because after the primary imputer has reworked X there will probably be nonan
values left for the second imputer to remodel. - We cut up the parameter grid between a classifier parameter grid and a characteristic engineering parameter grid. On the backside of the code, we be a part of these two grids collectively so that each characteristic engineering grid is mixed with each classifier grid, as a result of we wish to attempt a
max_tree_depth
of None, 2 and 5 for each theArbitraryNumberImputer
and theMeanMedianImputer
. - We use a listing of dictionaries as a substitute of a dictionary within the characteristic parameter grid, in order that we stop the
MeanMedianImputer
and theArbitraryNumberImputer
for being utilized on the similar time. Utilizing theapply
parameter ofEstimatorSwitch
we are able to merely activate or off one of many two imputers. After all, you additionally may run the code twice, as soon as with the primary imputer commented out, and the second run with the second imputer commented out. Nevertheless, this can result in errors in our parameter grid, so we would want to regulate that one as effectively, and the outcomes of the totally different imputation methods aren’t obtainable in the identical grid search cv outcomes, which makes it rather more tough to match.
Allow us to have a look at the brand new outcomes:
gridsearch = GridSearchCV(mannequin, param_grid=model_param_grid)
gridsearch.match(X_train, y_train)
gridsearch.prepare(X_train, y_train)pd.DataFrame(gridsearch.cv_results_).loc[:,
['rank_test_score',
'mean_test_score',
'param_tree__max_depth',
'param_meanmedianimputer__estimator__imputation_method']
].sort_values('rank_test_score')
We now see a brand new finest mannequin, which is the choice tree with a max_depth
of 2, utilizing the ArbitraryNumberImputer
. We improved the accuracy by 1.4% by implementing a special imputation technique! And as a welcome bonus, our tree depth has shrunk to 2, which makes the mannequin simpler to interpret.
After all, grid looking can already take fairly a while, and by not solely grid looking over the classifier but in addition over different pipeline steps the grid search can take longer as effectively. There are just a few strategies to maintain the additional time it takes to a minimal:
- First grid search over the classifier’s parameters, after which over different steps reminiscent of characteristic engineering steps, or vice versa, relying on the scenario.
- Use extra-datascience-tools’
filter_tried_params
to forestall duplicate parameter settings of a grid-search. - Use scikit-learn’s
HalvingGridSearch
orHalvingRandomSearch
as a substitute of aGridSearchCV
(nonetheless within the experimental part).
In addition to utilizing grid looking to optimize a classifier reminiscent of a choice tree, we noticed you may truly optimize just about any step in a machine studying pipeline utilizing extra-datascience-tools’ EstimatorSwitch
by e.g. grid looking over the imputation technique. Some extra examples of pipeline steps that are price grid looking over beside the imputation technique and the classifier itself are: