In this explainer we are going to look at a first example of using k Nearest Neighbour. This will give you a basic idea of how kNN is used and also a simple way to determine a good value of $k$ .

Add Libraries

As normal we start by adding some standard libraries for various functions we will use generally

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

A dataset of sea snails!

Ok so important stuff, can we predict the age (equivalent to number of rings) of a sea shell based on various simple measurements.

Let us first download the dataset and get moving.

url = (
    "https://archive.ics.uci.edu/ml/machine-learning-databases"
    "/abalone/abalone.data"
)
abalone = pd.read_csv(url, header = None)

abalone.columns = [
    'Sex',
    'Length',
    'Diameter',
    'Height',
    'Whole weight',
    'Shucked weight',
    'Viscera weight',
    'Shell weight',
    'Rings',
]

abalone = abalone.drop("Sex", axis = 1)

We dropped "Sex" as it is a categorical measurement, where all others are simple numbers. By dropping it, it makes life easier!

Ok, so let's look at our data

abalone.head()

abalone.describe().T

X = abalone.drop("Rings", axis = 1)
y = abalone["Rings"]

We first want to rescale our data, we do this using an inbuilt frunction in sklearn. This transforms all variables to range of 0 to 1. It is vital in many ML algorithms, otherwise the inputs and their contribution to predictions are no handled fairly.

from sklearn.preprocessing import scale

X_scaled = scale(X)

Validation : Train / Test Split

So we are going to want to train our model, and then test it against held out data. Here we again use sklearn's inbuild functions, holding out 20% of the data for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, train_size=0.2, random_state=42)

Fitting a Model

Now we are are in shape to fit a model. First we do this for a single given $k$ value, 10.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

k = 10

knn_model = KNeighborsRegressor(n_neighbors=k).fit(X_train, y_train)

y_pred = knn_model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(rmse)

We see here the average root mean squared error is $2.34$ rings / years out.

So let's look at a validation plot over the whole of the testing set and see how we did.

import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
plt.scatter(y_pred, y_test, c='cornflowerblue', alpha = 0.1)

perfect_line = np.linspace(3,23,2)

plt.plot(perfect_line, perfect_line, '-k', alpha=0.6)

plt.title('Model Validation for k = 10')
plt.xlim([2,19])
plt.xlabel("Predicted")
plt.ylabel("True")

You will note some banding in this, this is because the number of rings is an interger number rather, than a continous number. Hence in this case a kNN classifier might yield better results, but we wont worry here as we are primarily looking at the general principles.

Choosing $k$

So now we have built the model for a single value of $k$ , how do we explore the optimal value of $k$ ?

We now build models for all values of $k$ from $1$ all the way up to $125$

k = np.arange(1,125, 1)

mse = []

for i, n_neigh in enumerate(k):

    knn_model = KNeighborsRegressor(n_neighbors=n_neigh).fit(X_train, y_train)

    y_pred = knn_model.predict(X_test)

    mse.append(mean_squared_error(y_test, y_pred))

optimal_k = k[np.argmin(mse)]

print(optimal_k)

The optimal is $k = 18$ in this study.

Let us also plot out all the others.

plt.figure(figsize=(6,4))
plt.plot(k, mse, '-b', alpha = 0.6)
plt.scatter(optimal_k, np.min(mse), c='g')
plt.xlabel('Number of Neighbours')
plt.ylabel('Mean Square Error')
plt.legend()

Introduction to Machine Learning

5. K-Nearest Neighbour Walkthrough

Add Libraries

A dataset of sea snails!

Validation : Train / Test Split

Fitting a Model

Choosing $k$

Return to Lesson Index

Next Lesson

Introduction to Machine Learning

5. K-Nearest Neighbour Walkthrough

Add Libraries

A dataset of sea snails!

Validation : Train / Test Split

Fitting a Model

Choosing kkk

Return to Lesson Index

Next Lesson

Choosing $k$