Subdividing the Subsurface Based mostly on Properly Log Measurements
k-Nearest Neighbors (kNN) is a well-liked non-parametric supervised machine studying algorithm that may be utilized to each classification and regression-based issues. It’s simple to implement in Python and simple to grasp which makes it an awesome algorithm to start out studying about whenever you begin your machine-learning journey.
Inside this text, we’ll cowl how the kNN algorithm works and apply it to properly log knowledge utilizing Python’s Scikit-Study library.
How does the kNN Algorithm Work?
Classifying knowledge is likely one of the fundamental purposes of machine studying. In consequence, there are quite a few algorithms accessible. The kNN algorithm is only one of those.
The concept behind kNN is fairly easy. Factors which are close to one another are assumed to be comparable.
When a brand new knowledge level is launched to a skilled dataset the next steps happen
- Decide a worth for k — the variety of factors for use to categorise new knowledge factors
- Calculate the gap (Euclidean or Manhattan) between the info level to be labeled and k nearest factors
- Determine the k-nearest neighbors
- Amongst these k-nearest neighbors, we rely the variety of knowledge factors in every class
- Utilizing majority voting, assign the brand new knowledge level to the category that happens essentially the most
The easy instance under exhibits this course of the place we assume k is 3 and the closest factors are all a single class.
Within the case the place the k-nearest neighbors are a combination of lessons, we will use majority voting as illustrated under.
Purposes of k-Nearest Neighbors (kNN)
- Recommender Techniques
- Sample Detection — e.g Fraud detection
- Textual content mining
- Local weather forecasting
- Credit standing evaluation
- Medical Classification
- Lithology prediction
Benefits of k-Nearest Neighbors (kNN)
- Easy and simple to grasp
- Simple to implement with Python utilizing Sci-kit Study
- Will be quick to work on small datasets
- No have to tune a number of parameters
- No have to make assumptions concerning the knowledge
- Will be utilized to binary and multi-class issues
Disadvantages of k-Nearest Neighbors (kNN)
- Classification with giant datasets will be sluggish
- Impacted by the curse of dimensionality — because the variety of options will increase the algorithm might wrestle to make correct predictions
- Will be delicate to the size of the info, i.e. options measured utilizing completely different items
- Impacted by noise and outliers
- Delicate to imbalanced datasets
- Lacking values have to be dealt with previous to utilizing the algorithm
Importing the Required Libraries
For this tutorial, we require plenty of Python libraries and modules.
First, we’ll import pandas
as pd
. This library permits us to load knowledge from csv recordsdata and retailer that knowledge in reminiscence for later use.
Then we’ve got plenty of modules from the sci-kit be taught library:
KNeighborsClassifer
for finishing up the kNN classificationtrain_test_split
for splitting up our knowledge into coaching and testing datasetsStandardScaler
for standardising the scales of the optionsclassification_report
,confusion_matrix
andaccuracy_score
for assessing mannequin efficiency
Lastly, to visualise our knowledge we will probably be utilizing a combination of matplotlib
and seaborn
.
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
Importing the Required Information
The following step is to load our knowledge.
The dataset we’re utilizing for this tutorial is a subset of a coaching dataset used as a part of a Machine Studying competitors run by Xeek and FORCE 2020 (Bormann et al., 2020). It’s launched underneath a NOLD 2.0 licence from the Norwegian Authorities, particulars of which will be discovered right here: Norwegian Licence for Open Authorities Information (NLOD) 2.0.
The complete dataset will be accessed on the following hyperlink: https://doi.org/10.5281/zenodo.4351155.
To learn the info, we will name upon pd.read_csv()
and go within the relative location of the coaching file.
df = pd.read_csv('Information/Xeek_train_subset_clean.csv')
As soon as the info has been loaded, we will name upon the describe()
technique to view the numeric columns throughout the dataset. This gives us with an summary of the options.
df.describe()
Dealing With Lacking Information
Earlier than we proceed with the kNN algorithm, we first want to hold out some knowledge preparation.
Because the kNN algorithm doesn’t deal with lacking values we have to cope with these first. The only manner to try this is to hold out listwise deletion. This can delete rows if any of the options inside that row has lacking values.
It’s extremely really helpful that you just perform a full evaluation of your dataset to grasp the reason for the lacking knowledge and if it may be repaired.
Although this technique appears a fast resolution, it may well scale back your dataset considerably.
df = df.dropna()
Choosing Coaching and Check Options
Subsequent, we have to choose what options will probably be used to construct the kNN mannequin and what characteristic will probably be our goal characteristic.
For this instance, I’m utilizing a collection of properly logging measurements for constructing the mannequin, and a lithology description because the goal characteristic.
# Choose inputs and goal
X = df[['RDEP', 'RHOB', 'GR', 'NPHI', 'PEF', 'DTC']]
y = df['LITH']
As with every machine studying mannequin, we have to cut up our knowledge out right into a coaching set — which is used to coach/construct our mannequin — and a check set — which is used to validate the efficiency of our mannequin on unseen knowledge.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Standardising Function Values
When working with measurements which have completely different scales and ranges, it is very important standardise them. This helps to cut back mannequin coaching instances and reduces the impression on fashions that depend on distance-based calculations.
Standardising the info basically includes calculating the imply of a characteristic, subtracting it from every knowledge level after which dividing by the characteristic’s commonplace deviation.
Inside scikit-learn we will use the StandardScaler class to remodel our knowledge.
First we use the coaching knowledge to suit the mannequin after which rework it utilizing the fit_transform
operate.
In the case of the check knowledge, we don’t wish to match the StandardScaler to that knowledge as we’ve got already completed it. As a substitute, we simply wish to apply it. That is completed utilizing the rework
technique.
You will need to word that the StandardScaler is being utilized after the practice check cut up and it’s only being fitted to the coaching dataset. As soon as the Scaler mannequin has been fitted, it’s then utilized to the check dataset. This helps stop the leakage of information from the check dataset into the kNN mannequin.
scaler = StandardScaler()#Match the StandardScaler to the coaching knowledge
X_train = scaler.fit_transform(X_train)
# Apply the StandardScaler, however not match, to the validation knowledge
X_test = scaler.rework(X_test)
When creating the KNeighborsClassifier we will specify a couple of parameters. Full particulars of those will be discovered right here. In fact, we don’t have to provide something and the default parameters will probably be used.
By default, the variety of factors used to categorise new knowledge factors is ready to five. Which means the category of the 5 closest factors will probably be used to categorise that new level.
clf = KNeighborsClassifier()
As soon as the classifier has been initialised, we subsequent want to coach the mannequin utilizing our coaching knowledge (X_train
& y_train
). To do that, we name upon clf
adopted by the match
technique.
Inside the match
technique, we go in our coaching knowledge.
clf.match(X_train, y_train)
As soon as the mannequin has been skilled, we will now make predictions on our check knowledge by calling upon the predict
technique from the classifier.
y_pred = clf.predict(X_test)
Utilizing Mannequin Accuracy
To know how our mannequin has carried out on the check knowledge we will use plenty of metrics and instruments.
If we wish a fast evaluation of how properly our mannequin has carried out, we will name upon the accuracy rating technique. This gives us with a sign of what number of predictions have been appropriate relative to the entire variety of predictions.
accuracy_score(y_test, y_pred)
This returns a worth of 0.8918532439941167 and tells us that our mannequin has predicted 89.2% of our labels accurately.
Bear in mind that this worth could also be deceptive, particularly if we’re coping with an imbalanced dataset. If there’s a class that dominates, then this class has a better likelihood of being predicted accurately in comparison with a minority class. The category that dominates will affect the accuracy rating by making it larger and thus giving a misunderstanding that our mannequin has completed job.
Utilizing Classification Report
We will take our evaluation additional and take a look at the classification report. This gives further metrics in addition to a sign of how properly every class was predicted.
The extra metrics are:
- precision: Gives a sign of what number of values have been accurately predicted inside that class. Values are between 0.0 and 1.0, with 1 being the perfect and 0 being the worst.
- recall: Gives a measure of how properly the classifier is ready to discover the entire constructive instances for that class.
- f1-score: Weighted harmonic imply of precision and recall and generates values between 1.0 (which is nice) and 0.0 (which is poor).
- assist: That is the entire variety of situations of that class throughout the dataset.
To view the classification report we will name upon the observe code and go in y_test
and y_pred
to the classification_report
operate.
print(classification_report(y_test, y_pred))
If we take a look at the outcomes carefully, we will see we’re coping with an imbalanced dataset. We will see that Shale, Sandstone and Limestone lessons dominate and in consequence have comparatively excessive precision and recall scores. Whereas Halite, Tuff and Dolomite have comparatively low precion and recall.
At this level I’d take into account going again to the unique dataset and determine ways in which I might cope with that imbalance. Doing so ought to drastically enhance the mannequin efficiency.
Confusion Matrix
We will use one other software to take a look at how properly our mannequin has carried out and that’s the confusion matrix. This software gives a abstract of how properly our classification mannequin has carried out when making predictions for every class.
The generated confusion matrix has two axes. One axis comprises the category that the mannequin predicted, and the opposite axis comprises the precise class label.
We will generate two variations of this inside Python. The primary is a straightforward printed readout of the confusion matrix which will be laborious to learn or current to others. The second is a heatmap model generated utilizing seaborn
# Easy Printed Confusion Matrix
cf_matrix = confusion_matrix(y_test, y_pred)
print(cf_matrix)# Graphical model utilizing seaborn and matplotlib
# Put together the labels for the axes
labels = ['Shale', 'Sandstone', 'Sandstone/Shale',
'Limestone', 'Tuff', 'Marl', 'Anhydrite',
'Dolomite', 'Chalk', 'Coal', 'Halite']
labels.type()
# Setup the determine
fig = plt.determine(figsize=(10,10))
ax = sns.heatmap(cf_matrix, annot=True, cmap='Reds', fmt='.0f',
xticklabels=labels,
yticklabels = labels)
ax.set_title('Seaborn Confusion Matrix with labelsnn')
ax.set_xlabel('nPredicted Values')
ax.set_ylabel('Precise Values ');
Once we run the above code we get the next printed desk and plot.
The ensuing confusion matrix gives us with a sign of what lessons the mannequin predicted accurately and incorrectly. We will begin to determine any patterns the place the mannequin could also be mispredicting lithologies.
For instance, if we take a look at the Limestone class. We will see 2,613 factors have been predicted accurately, nonetheless, 185 have been predicted as Chalk and 135 as Marl. Each of those lithologies have a calcitic nature and share comparable properties to limestone. Due to this fact, we might return and take a look at our options to find out if different options are required or if some have to be eliminated.
The k-Nearest Neighbors algorithm is a strong, but easy-to-understand supervised machine studying algorithm that may be utilized to classification-based issues, particularly throughout the geoscience area.
This tutorial has proven how we will take a collection of pre-classified properly log measurements and make predictions about new knowledge. Nonetheless, care ought to be taken when preprocessing the info and coping with imbalanced datasets, which is frequent in subsurface purposes.