Support Vector Machines in the Wild

📂 Resources

Download the resources for this lesson here.

To see how this method can work on real data, let's import the breast cancer dataset from scikit-learn. This dataset records brest cancer dianoses based on assessment of patients. We shall attempt to learn these diagnoses with an SVM classifier.

📑 Learning Objectives

Import and explore the breast cancer dataset from scikit-learn.
Build an SCM classifier to predict whether tumors are malignant or benign.
Unpack the predictor to find the support vectors from the dataset.

Diagnosing Tumours in a Breast Cancer Dataset

from sklearn import datasets
data = datasets.load_breast_cancer()

See https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+Diagnostic for a description of the dataset.

Simply, the problem is to build a model to classify inputs into one of two targets categories:

Labels:  ['malignant' 'benign']

Given the following 30 input features.

Features:  ['mean radius' 'mean texture' 'mean perimeter'
 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

We are given data points for 569 different patients which we'll split into train:test datasets with a 80:20 ratio. This lets us hold some data in order to test the performance of the model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2,random_state=109,
)

Fitting a linear SVM is straight forward:

from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)

Let's inspect the performance of our SVM on the held-out test data.

y_pred = clf.predict(X_test)

Then we ask ourselves, 'how well did we perform?'

Model accuracy:  0.956
Precision: 0.986
Recall: 0.946

Recall that we interpret the above results as follows:

Model accuracy is the correctly classified samples: total correct predictions as a ratio of all events.
Precision is the number of true positives as ratio of the total number of positive predictions: ''how often was the positive prediction positive?''
Recall indicates the number of positive predictions as a ratio of all possible positives: ''how many did we find?''

The number of support vectors are 51 in total, balanced on each side of the boundary. This information give a far reduced set of patients that medical professionals can study more carefully to learn about the diagnosis.

Tuning performance and hyper-parameters

Let's use our grid search assessment from before, but this time include a measure of recall in our review.

import pandas as pd


def print_dataframe(filtered_cv_results):
    """Pretty print for filtered dataframe"""
    for mean_precision, std_precision, mean_recall, std_recall, params in zip(
        filtered_cv_results["mean_test_precision"],
        filtered_cv_results["std_test_precision"],
        filtered_cv_results["mean_test_recall"],
        filtered_cv_results["std_test_recall"],
        filtered_cv_results["params"],
    ):
        print(
            f"precision: {mean_precision:0.3f} (±{std_precision:0.03f}),"
            f" recall: {mean_recall:0.3f} (±{std_recall:0.03f}),"
            f" for {params}"
        )
    print()


def refit_strategy(cv_results):
    """Define the strategy to select the best estimator.

    The strategy defined here is to filter-out all results below a precision threshold
    of 0.98, rank the remaining by recall and keep all models with one standard
    deviation of the best by recall. Once these models are selected, we can select the
    fastest model to predict.

    Parameters
    ----------
    cv_results : dict of numpy (masked) ndarrays
        CV results as returned by the `GridSearchCV`.

    Returns
    -------
    best_index : int
        The index of the best estimator as it appears in `cv_results`.
    """
    # print the info about the grid-search for the different scores
    precision_threshold = 0.95

    cv_results_ = pd.DataFrame(cv_results)
    print("All grid-search results:")
    print_dataframe(cv_results_)

    # Filter-out all results below the threshold
    high_precision_cv_results = cv_results_[
        cv_results_["mean_test_precision"] > precision_threshold
    ]

    print(f"Models with a precision higher than {precision_threshold}:")
    print_dataframe(high_precision_cv_results)

    high_precision_cv_results = high_precision_cv_results[
        [
            "mean_score_time",
            "mean_test_recall",
            "std_test_recall",
            "mean_test_precision",
            "std_test_precision",
            "rank_test_recall",
            "rank_test_precision",
            "params",
        ]
    ]

    # Select the most performant models in terms of recall
    # (within 1 sigma from the best)
    best_recall_std = high_precision_cv_results["mean_test_recall"].std()
    best_recall = high_precision_cv_results["mean_test_recall"].max()
    best_recall_threshold = best_recall - best_recall_std

    high_recall_cv_results = high_precision_cv_results[
        high_precision_cv_results["mean_test_recall"] > best_recall_threshold
    ]
    print(
        "Out of the previously selected high precision models, we keep all the\n"
        "the models within one standard deviation of the highest recall model:"
    )
    print_dataframe(high_recall_cv_results)

    # From the best candidates, select the fastest model to predict
    fastest_top_recall_high_precision_index = high_recall_cv_results[
        "mean_score_time"
    ].idxmin()

    print(
        "\nThe selected final model is the fastest to predict out of the previously\n"
        "selected subset of best models based on precision and recall.\n"
        "Its scoring time is:\n\n"
        f"{high_recall_cv_results.loc[fastest_top_recall_high_precision_index]}"
    )

    return fastest_top_recall_high_precision_index

Now we can vary through different values of kernel as well as $C$ .

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

tuned_parameters = [
    {"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},
    {"kernel": ["linear"], "C": [1, 10, 100, 1000]},
]

grid_search = GridSearchCV(
    SVC(), tuned_parameters, scoring=scores, refit=refit_strategy
)
grid_search.fit(X_train, y_train)

The search returns out optimal hyper parameters below.

All grid-search results:
precision: 0.942 (±0.006), recall: 0.922 (±0.030), for {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 0.929 (±0.014), recall: 0.961 (±0.026), for {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 0.939 (±0.015), recall: 0.919 (±0.028), for {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 0.933 (±0.034), recall: 0.961 (±0.034), for {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 0.935 (±0.014), recall: 0.919 (±0.028), for {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 0.915 (±0.031), recall: 0.944 (±0.013), for {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 0.935 (±0.014), recall: 0.919 (±0.028), for {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'}
precision: 0.909 (±0.035), recall: 0.936 (±0.014), for {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}
precision: 0.960 (±0.035), recall: 0.979 (±0.021), for {'C': 1, 'kernel': 'linear'}
precision: 0.957 (±0.034), recall: 0.982 (±0.022), for {'C': 10, 'kernel': 'linear'}
precision: 0.953 (±0.030), recall: 0.975 (±0.021), for {'C': 100, 'kernel': 'linear'}
precision: 0.953 (±0.034), recall: 0.975 (±0.021), for {'C': 1000, 'kernel': 'linear'}

Models with a precision higher than 0.95:
precision: 0.960 (±0.035), recall: 0.979 (±0.021), for {'C': 1, 'kernel': 'linear'}
precision: 0.957 (±0.034), recall: 0.982 (±0.022), for {'C': 10, 'kernel': 'linear'}
precision: 0.953 (±0.030), recall: 0.975 (±0.021), for {'C': 100, 'kernel': 'linear'}
precision: 0.953 (±0.034), recall: 0.975 (±0.021), for {'C': 1000, 'kernel': 'linear'}

Out of the previously selected high precision models, we keep all the
the models within one standard deviation of the highest recall model:
precision: 0.957 (±0.034), recall: 0.982 (±0.022), for {'C': 10, 'kernel': 'linear'}


The selected final model is the fastest to predict out of the previously
selected subset of best models based on precision and recall.
Its scoring time is:

mean_score_time 0.001023
mean_test_recall 0.982331
std_test_recall 0.022292
mean_test_precision 0.956626
std_test_precision 0.034467
rank_test_recall 1
rank_test_precision 2
params {'C': 10, 'kernel': 'linear'}
Name: 9, dtype: object

Success! We obtain our best fitting model past our threshold which performs in the fastest time in inference.

Support Vector Machines

13. Support Vector Machines in the Wild

Diagnosing Tumours in a Breast Cancer Dataset

Tuning performance and hyper-parameters

Return to Lesson Index

Next Lesson