Experimenting with Overlapping Class Distributions

📂 Resources

Download the resources for this lesson here.

In this video we explore overlapping classes by allowing the margin to become pourous in our SVM. We shall extend our first code example to demonstrate this as well as reviewing crucial components of the model. In particular we shall run through a grid search to optimise the hyper-parameter selections.

📑 Learning Objectives

Demonstrate a dataset with overlapping class distributions.
Implement a support vector machine with slack variables.
Tune the hyperparameters and understand the effect of these decisions.

Hyper-Parameter selection

Let's review some of the important hyper-parameters that govern our model selection. Our flex in permitting a soft margin comes through the parameter $C$ in the objective

\frac{1}{2} \mathbf{w}^{T}\mathbf{w} + C \sum_{i=1}^{N} \xi_{i}.

The slack variables $\xi_{i} \geq 0$ allow points to drift over the margin:

y_{i}(\mathbf{w}^{T}\phi(\mathbf{x}_{i}) + b) \geq 1 - \xi_{i}

If we increase the value of $C > 0$ we amplify the effect of the slack variables in the loss, hence penalising points drifting over the margin boundary. Conversly, as $C\rightarrow 0$ , we minimise slack-variable penalisation and amplify the regularisation of the parameters $\mathbf{w}$ , which can eventually cause underfitting.

Over-lapping class boundaries

Easing our foot off the break, here we explore the code that allows vectors inside the margin, as well as miss-classified points, to occur. First of all we initiate some parameters.

NUM_DATA = 200
SPREAD = 0.13
REDUCTION = 0.2
SEED = 31
np.random.seed(SEED)

def get_blobs(num_samples, std_dev):
    """Generate two 2D normal distributions in the NE and SW quadrants."""
    cov = np.asarray([[std_dev, 0], [0, std_dev]])
    mean_ne = np.asarray(2*[2.5,])
    mean_sw = np.asarray(2*[1.5,])
    ne = np.random.multivariate_normal(mean=mean_ne, cov=cov, size=num_samples)
    sw = np.random.multivariate_normal(mean=mean_sw, cov=cov, size=num_samples)
    return ne, sw

ne, sw = get_blobs(NUM_DATA, SPREAD * REDUCTION)

Now we adjust our initial model to allow for a soft margin. This is achieved by reducing the hyperparameter $C$ .

# Set up axes
LINE_MIN = 0.5
LINE_MAX= 3.5
AX_MIN = 0.25
AX_MAX = 3.75
SPACE = 0.05
fig, ax = plt.subplots(1, 1, figsize=[6, 6])
plt.subplots_adjust(wspace=SPACE, hspace=SPACE)
titles = ['Overlapping margins', 'Hard classification']
C = [1, 10]


ax.tick_params(direction='in')
ax.set_xlim(AX_MIN, AX_MAX)
ax.set_ylim(AX_MIN, AX_MAX)
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)

ne, sw = get_blobs(NUM_DATA, SPREAD)

# Organise the data
X = np.concatenate((ne, sw))
y = np.asarray(len(ne)*[1,] + len(sw)*[-1,])

# Randomly order the data, for good measure
X, y = shuffle(X, y, random_state=SEED)

# Fit the Support Vector Classifier
clf = svm.SVC(kernel='linear', C=10)
clf.fit(X, y)

a = [1, 1]
b = [3, 3]
pred_a, pred_b = clf.predict([a, b])
print('\nExample predictions:')
print(f'\t {a} is classified by {pred_a}')
print(f'\t {b} is classified by {pred_b}')
print(
    f'Number of support vectors: {clf.n_support_} in classes {clf.classes_}'
)

# Hyperplane equation: c[0]*y + c[1]*x + intercept_[0] = 0
c = clf.coef_[0]
slope = -c[0] / c[1]
y0 = slope * x0 - (clf.intercept_[0] / c[1])

# Margin boundaries
margin = 1 / np.sqrt(np.sum(clf.coef_**2))
y_neg = y0 - np.sqrt(1 + slope**2) * margin
y_pos = y0 + np.sqrt(1 + slope**2) * margin

# Plot data and ground truth
gold = ax.scatter(sw[:, 0], sw[:, 1], s=10, color='goldenrod')
blue = ax.scatter(ne[:, 0], ne[:, 1], s=10, color='navy')
gt, = ax.plot(x0, -x0 + 4, color='lightcoral', linestyle='--', linewidth=2)
svs = ax.scatter(
            clf.support_vectors_[:, 0],
            clf.support_vectors_[:, 1],
            s=80,
            facecolors='none',
            zorder=10,
            edgecolors='fuchsia',
)
pred, = ax.plot(x0, y0, "g-")
mar, = ax.plot(x0, y_neg, "g--")
ax.plot(x0, y_pos, "g--")

ax.legend(
    [svs, mar, pred, gt],
    ['Support vectors', '$y=\pm1$', '$y=0$', 'GT'],
    loc='upper left',
    framealpha=1.,
)

plt.show()

To choose the appropriate hyper-parameters for our model, we can grid-search over a range and assess model performance with cross validation.

scores = ["precision", "recall"]

import pandas as pd


def print_dataframe(filtered_cv_results):
    """Pretty print for filtered dataframe"""
    for mean_precision, std_precision, mean_recall, std_recall, params in zip(
        filtered_cv_results["mean_test_precision"],
        filtered_cv_results["std_test_precision"],
        filtered_cv_results["mean_test_recall"],
        filtered_cv_results["std_test_recall"],
        filtered_cv_results["params"],
    ):
        print(
            f"precision: {mean_precision:0.3f} (±{std_precision:0.03f}),"
            f" recall: {mean_recall:0.3f} (±{std_recall:0.03f}),"
            f" for {params}"
        )
    print()


def refit_strategy(cv_results):
    """Define the strategy to select the best estimator.

    The strategy defined here is to filter-out all results below a precision threshold
    of 0.98, rank the remaining by recall and keep all models with one standard
    deviation of the best by recall. Once these models are selected, we can select the
    fastest model to predict.

    Parameters
    ----------
    cv_results : dict of numpy (masked) ndarrays
        CV results as returned by the `GridSearchCV`.

    Returns
    -------
    best_index : int
        The index of the best estimator as it appears in `cv_results`.
    """
    # print the info about the grid-search for the different scores
    precision_threshold = 0.975

    cv_results_ = pd.DataFrame(cv_results)
    print("All grid-search results:")
    print_dataframe(cv_results_)

    # Filter-out all results below the threshold
    high_precision_cv_results = cv_results_[
        cv_results_["mean_test_precision"] > precision_threshold
    ]

    print(f"Models with a precision higher than {precision_threshold}:")
    print_dataframe(high_precision_cv_results)

    high_precision_cv_results = high_precision_cv_results[
        [
            "mean_score_time",
            "mean_test_recall",
            "std_test_recall",
            "mean_test_precision",
            "std_test_precision",
            "rank_test_recall",
            "rank_test_precision",
            "params",
        ]
    ]

    # From the best candidates, select the fastest model to predict
    fastest_top_recall_high_precision_index = high_precision_cv_results[
        "mean_score_time"
    ].idxmin()

    print(
        "\nThe selected final model is the fastest to predict out of the previously\n"
        "selected subset of best models based on precision and recall.\n"
        "Its scoring time is:\n\n"
        f"{high_precision_cv_results.loc[fastest_top_recall_high_precision_index]}"
    )

    return fastest_top_recall_high_precision_index

Then we can run the search and print the conclusions.

from sklearn.model_selection import GridSearchCV

svc = svm.SVC()
param_grid = [
  {'C': [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10], 'kernel': ['linear']},
]
clf = GridSearchCV(
    svc, param_grid, scoring=scores, refit=refit_strategy
)
clf.fit(X, y)

For our model, we conclude that the best (and fastest) choice is $C=0.05$ .

All grid-search results:
precision: 0.980 (±0.019), recall: 0.975 (±0.016), for {'C': 0.001, 'kernel': 'linear'}
precision: 0.980 (±0.010), recall: 0.975 (±0.016), for {'C': 0.005, 'kernel': 'linear'}
precision: 0.965 (±0.012), recall: 0.975 (±0.016), for {'C': 0.01, 'kernel': 'linear'}
precision: 0.965 (±0.012), recall: 0.975 (±0.016), for {'C': 0.05, 'kernel': 'linear'}
precision: 0.965 (±0.012), recall: 0.975 (±0.016), for {'C': 0.1, 'kernel': 'linear'}
precision: 0.965 (±0.012), recall: 0.975 (±0.016), for {'C': 0.5, 'kernel': 'linear'}
precision: 0.970 (±0.010), recall: 0.975 (±0.016), for {'C': 1, 'kernel': 'linear'}
precision: 0.970 (±0.010), recall: 0.975 (±0.016), for {'C': 5, 'kernel': 'linear'}
precision: 0.970 (±0.010), recall: 0.975 (±0.016), for {'C': 10, 'kernel': 'linear'}

Models with a precision higher than 0.975:
precision: 0.980 (±0.019), recall: 0.975 (±0.016), for {'C': 0.001, 'kernel': 'linear'}
precision: 0.980 (±0.010), recall: 0.975 (±0.016), for {'C': 0.005, 'kernel': 'linear'}

The selected final model is the fastest to predict out of the previously selected subset of best models based on precision and recall. Its scoring time is:

mean_score_time                                0.000968
mean_test_recall                                  0.975
std_test_recall                                0.015811
mean_test_precision                            0.979994
std_test_precision                             0.010011
rank_test_recall                                      1
rank_test_precision                                   2
params                 {'C': 0.005, 'kernel': 'linear'}
Name: 1, dtype: object

Support Vector Machines

11. Experimenting with Overlapping Class Distributions

Hyper-Parameter selection

Over-lapping class boundaries

Return to Lesson Index

Next Lesson