by Prof Tim Dodwell
General Linear Models
10. Logistic Regression in Wild
True of False - Logistic Regression in the Wild
In this walk through we are going to look at logistic regression, an example of generalised linear model which enables us to achieve classification in a regression framework.
Import Libraries
import numpy as np
import pandas as pd
Pima Indians Diabetes Dataset
Here we load a dataset from kaggle
This is a data set with data about women, various measurements and the outcome of weather they had diabetes or not. We are going to build a classification model using logistic regression to try and predict the result of these tests.
df = pd.read_csv('diabetes.csv', sep=',')
df.head()
We can see there are 9 coloumns, and in total 768 entries, we can see this by calling the following
df.info()
Ok, to let us pull out some feature (or input) coloumns we are interested in. and then assign y
to the outcome!
feature_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness','Insulin','BMI','DiabetesPedigreeFunction', 'Age']
X = df[feature_cols]
y = df.Outcome
As normal we will split out data into training data and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state= 16)
Now we are in a good position to fit our logistic model, and then use the train model to predict our testing data
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
So how did we do? Let us plot a confusion matrix to see how the classification did!
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
class_names=[0,1] # name of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Next we can look at a ROC_Curve as well
y_pred_proba = logreg.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = np.around(metrics.roc_auc_score(y_test, y_pred_proba), decimals = 2)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.plot([0.0, 1.0], [0.0, 1.0], '-k', alpha=0.4, label="Random Classifier")
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc=4)
plt.show()