Prof Tim Dodwell

by Prof Tim Dodwell

Lesson

General Linear Models

10. Logistic Regression in Wild

True of False - Logistic Regression in the Wild


In this walk through we are going to look at logistic regression, an example of generalised linear model which enables us to achieve classification in a regression framework.

Import Libraries

import numpy as np
import pandas as pd

Pima Indians Diabetes Dataset

Here we load a dataset from kaggle

This is a data set with data about women, various measurements and the outcome of weather they had diabetes or not. We are going to build a classification model using logistic regression to try and predict the result of these tests.

df  = pd.read_csv('diabetes.csv', sep=',')
df.head()
academy.digilab.co.uk

We can see there are 9 coloumns, and in total 768 entries, we can see this by calling the following

df.info()
academy.digilab.co.uk

Ok, to let us pull out some feature (or input) coloumns we are interested in. and then assign y to the outcome!

feature_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness','Insulin','BMI','DiabetesPedigreeFunction', 'Age']
X = df[feature_cols]
y = df.Outcome

As normal we will split out data into training data and testing data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 16)

Now we are in a good position to fit our logistic model, and then use the train model to predict our testing data

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='liblinear')

logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

So how did we do? Let us plot a confusion matrix to see how the classification did!

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import metrics

cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
academy.digilab.co.uk

Next we can look at a ROC_Curve as well

y_pred_proba = logreg.predict_proba(X_test)[::,1]

fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)

auc = np.around(metrics.roc_auc_score(y_test, y_pred_proba), decimals = 2)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.plot([0.0, 1.0], [0.0, 1.0], '-k', alpha=0.4, label="Random Classifier")
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc=4)
plt.show()
academy.digilab.co.uk