Incorporate Area Information into Your Mannequin with Rule-Primarily based Studying
You’re given a labeled dataset and assigned to foretell a brand new one. What would you do?
The primary strategy that you just most likely strive is to coach a machine studying mannequin to search out guidelines for labeling new information.
That is handy, however it’s difficult to know why the machine studying mannequin comes up with a selected prediction. You can also’t incorporate your area information into the mannequin.
As an alternative of relying on a machine studying mannequin to make predictions, is there a solution to set the principles for information labeling primarily based in your information?
That’s when human-learn is useful.
human-learn is a Python bundle to create rule-based methods which might be simple to assemble and are suitable with scikit-learn.
To put in human-learn, sort:
pip set up human-learn
Within the earlier article, I talked about find out how to create a human studying mannequin by drawing:
On this article, we are going to learn to create a mannequin with a easy operate.
Be happy to play and fork the supply code of this text right here:
To guage the efficiency of a rule-based mannequin, let’s begin with predicting a dataset utilizing a machine studying mannequin.
We are going to use the Occupation Detection Dataset from UCI Machine Studying Repository for example for this tutorial.
Our process is to foretell room occupancy primarily based on temperature, humidity, mild, and CO2. A room shouldn’t be occupied if Occupancy=0
and is occupied if Occupancy=1
.
After downloading the dataset, unzip and skim the information:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report# Get prepare and take a look at information
prepare = pd.read_csv("occupancy_data/datatraining.txt").drop(columns="date")
take a look at = pd.read_csv("occupancy_data/datatest.txt").drop(columns="date")
# Get X and y
goal = "Occupancy"
train_X, train_y = prepare.drop(columns=goal), prepare[target]
val_X, val_y = take a look at.drop(columns=goal), take a look at[target]
Check out the primary ten data of the prepare
dataset:
prepare.head(10)
Practice the scikit-learn’s RandomForestClassifier
mannequin on the coaching dataset and use it to foretell the take a look at dataset:
# Practice
forest_model = RandomForestClassifier(random_state=1)# Preduct
forest_model.match(train_X, train_y)
machine_preds = forest_model.predict(val_X)
# Evalute
print(classification_report(val_y, machine_preds))
The rating is fairly good. Nevertheless, we’re not sure how the mannequin comes up with these predictions.
Let’s see if we are able to label the brand new information with easy guidelines.
There are 4 steps to create guidelines for labeling information:
- Generate a speculation
- Observe the information to validate the speculation
- Begin with easy guidelines primarily based on the observations
- Enhance the principles
Generate a Speculation
Mild in a room is an effective indicator of whether or not a room is occupied. Thus, we are able to assume that the lighter a room is, the extra possible will probably be occupied.
Let’s see if that is true by wanting on the information.
Observe the Information
To validate our guess, let’s use a field plot to search out the distinction within the quantity of sunshine between an occupied room (Occupancy=1
) and an empty room (Occupancy=0
).
import plotly.specific as px
import plotly.graph_objects as gofunction = "Mild"
px.field(data_frame=prepare, x=goal, y=function)
We will see a major distinction within the median between an occupied and an empty room.
Begin with Easy Guidelines
Now, we are going to create guidelines for whether or not a room is occupied primarily based on the sunshine in that room. Particularly, if the quantity of sunshine is above a sure threshold, Occupancy=1
and Occupancy=0
in any other case.
However what ought to that threshold be? Let’s begin with choosing 100
to be threshold and see what we get.
To create a rule-based mannequin with human-learn, we are going to:
- Write a easy Python operate that specifies the principles
- Use
FunctionClassifier
to show that operate right into a scikit-learn mannequin
import numpy as np
from hulearn.classification import FunctionClassifierdef create_rule(information: pd.DataFrame, col: str, threshold: float=100):
return np.array(information[col] > threshold).astype(int)
mod = FunctionClassifier(create_rule, col='Mild')
Predict the take a look at set and consider the predictions:
mod.match(train_X, train_y)
preds = mod.predict(val_X)
print(classification_report(val_y, preds))
The accuracy is healthier than what we received earlier utilizing RandomForestClassifier
!
Enhance the Guidelines
Let’s see if we are able to get a greater outcome by experimenting with a number of thresholds. We are going to use parallel coordinates to research the relationships between a particular worth of sunshine and room occupancy.
from hulearn.experimental.interactive import parallel_coordinatesparallel_coordinates(prepare, label=goal, top=200)
From the parallel coordinates, we are able to see that the room with a lightweight above 250 Lux has a excessive likelihood of being occupied. The optimum threshold that separates an occupied room from an empty room appears to be someplace between 250 Lux and 750 Lux.
Let’s discover one of the best threshold on this vary utilizing scikit-learn’s GridSearch
.
from sklearn.model_selection import GridSearchCVgrid = GridSearchCV(mod, cv=2, param_grid={"threshold": np.linspace(250, 750, 1000)})
grid.match(train_X, train_y)
Get one of the best threshold:
best_threshold = grid.best_params_["threshold"]
best_threshold
> 364.61461461461465
Plot the brink on the field plot.
Use the mannequin with one of the best threshold to foretell the take a look at set:
human_preds = grid.predict(val_X)
print(classification_report(val_y, human_preds))
The brink of 365
provides a greater outcome than the brink of 100
.
Utilizing area information to create guidelines with a rule-based mannequin is sweet, however there are some disadvantages:
- It doesn’t generalize properly to unseen information
- It’s tough to provide you with guidelines for advanced information
- There is no such thing as a suggestions loop to enhance the mannequin
Thus, combing a rule-based mannequin and an ML mannequin will assist information scientists scale and enhance the mannequin whereas nonetheless with the ability to incorporate their area experience.
One simple solution to mix the 2 fashions is to resolve whether or not to cut back false negatives or false positives.
Scale back False Negatives
You may need to cut back false negatives in eventualities corresponding to predicting whether or not a affected person has most cancers (it’s higher to make a mistake telling sufferers that they’ve most cancers than to fail to detect most cancers).
To scale back false negatives, select optimistic labels when two fashions disagree.
Scale back False Positives
You may need to cut back false positives in eventualities corresponding to recommending movies that may be violent to children (it’s higher to make the error of not recommending kid-friendly movies than to suggest grownup movies to children).
To scale back false positives, select adverse labels when two fashions disagree.
You may also use different extra advanced coverage layers to resolve which prediction to select from.
For a deeper dive into find out how to mix an ML mannequin and a rule-based mannequin, I like to recommend checking this glorious video by Jeremy Jordan.
Congratulations! You may have simply realized what a rule-based mannequin is and find out how to mix it with a machine-learning mannequin. I hope this text provides you the information wanted to develop your personal rule-based mannequin.