Day 46(ML) — Implementation of Support Vector Machine for classification

Let’s implement the support vector classifier using sci-kit learn package from python. The idea for regression also follows a similar approach as classification, except we try to find the best fit line by minimizing the L2 norm of the weights(hard margin). For the soft margin, we allow some misclassifications controlled by a slack variable epsilon and the parameter ‘C’ that defines how much tolerance the model has towards the errors.

Implementation of Support Vector Classifier: We will be using the bank campaign dataset for our analysis. The aim is to forecast if the client will subscribe to a term deposit(Yes/No). This comes under the classification problem statement. We have a mix of input predictors both the numeric and categorical.

campaign_df = pd.read_csv('bank-full.csv')
#take a deep copy of the dataset
campaign_copy_df = copy.deepcopy(campaign_df)#display the first five rows
campaign_copy_df.head()

Checking whether each column has respective information as per the description.

#Numerical values check
print('Check whether any of the numeric columns has negative data:')
num_cols = campaign_copy_df.select_dtypes(include = ['int64'])num_count = (num_cols < 0).sum()
print(num_count)
print("\nDistinct negative values corresponding to 'pdays'")
pdays_negative_values = campaign_copy_df['pdays'][campaign_copy_df['pdays'] < 0].value_counts()
print(pdays_negative_values)
print("\nINFERENCE : (i) 3766 customers have negative balance.\n"
" (ii) 36954 customers have -1 pdays implying\n"
" the client was not contacted before.\n")
Check whether any of the numeric columns has negative data:
age 0
balance 3766
day 0
duration 0
campaign 0
pdays 36954
previous 0
dtype: int64

Distinct negative values corresponding to 'pdays'
-1 36954
Name: pdays, dtype: int64

We can observe that around 3766 customers have a negative balance and 36954 customers have -1 pdays implying the client was not contacted before.

Next step is understanding the five-point summary of numerical attributes.

#five point summary calculation

describe = campaign_copy_df.describe()
for i in describe:
print('\nMean of {}:'.format(i), describe[i]['mean'])
print('Median of {}:'.format(i), campaign_copy_df[i].median())
print('Std of {}:'.format(i),describe[i]['std'])
print('INFERENCE:', inference[i])
campaign_copy_df.describe()Mean of age: 40.93621021432837
Median of age: 39.0
Std of age: 10.618762040975431
INFERENCE: There is a slight diff between mean and median.
This indicates presence of outliers and skewness.


Mean of balance: 1362.2720576850766
Median of balance: 448.0
Std of balance: 3044.7658291686002
INFERENCE: There is a huge diff between mean and median.
The standard deviation is very high.
This indicates presence of large number of outliers and right skewed as mean > median.


Mean of day: 15.80641879188693
Median of day: 16.0
Std of day: 8.322476153044185
INFERENCE: There is only a slight diff between mean and median.
This could be a sign of minimal or no outliers.


Mean of duration: 258.1630797814691
Median of duration: 180.0
Std of duration: 257.52781226517095
INFERENCE: There is a huge diff between mean and median.
The standard deviation is very high.
This indicates presence of large number of outliers and right skewed as mean > median.


Mean of campaign: 2.763840658246887
Median of campaign: 2.0
Std of campaign: 3.0980208832802205
INFERENCE: There is a diff between mean and median.
This indicates presence of outliers and right skewness.


Mean of pdays: 40.19782796222158
Median of pdays: -1.0
Std of pdays: 100.1287459906047
INFERENCE: There is a huge diff between mean and median.
The standard deviation is very high.
This indicates presence of large number of outliers and right skewed as mean > median.


Mean of previous: 0.5803233726305546
Median of previous: 0.0
Std of previous: 2.3034410449314233
INFERENCE: There is a significant diff between mean and median.
The standard deviation is very high.
This indicates presence of large number of outliers and right skewed as mean > median.

Another way of validating the outliers is by using the interquartile range. Notes: Since our main focus is on the SVM, simple univariate techniques are used but we can always try out different outlier mechanisms using PYOD package discussed before.

#outliers calculation
Q1 = campaign_copy_df.quantile(0.25)
Q3 = campaign_copy_df.quantile(0.75)
IQR = Q3-Q1
lower_bound = Q1 - (1.5*IQR)
upper_bound = Q3 + (1.5*IQR)
for i in describe:
print('\n',i)
left_outlier_count = (campaign_copy_df[i] < lower_bound[i]).sum()
right_outlier_count = (campaign_copy_df[i] > upper_bound[i]).sum()
print('Number of left outliers present:', left_outlier_count)
print('Number of right outliers present:', right_outlier_count)
print('measure of skewness:', campaign_copy_df[i].skew())
print('Inference: ', inference[i])
age
Number of left outliers present: 0
Number of right outliers present: 487
measure of skewness: 0.6848179257252598
Inference: The data is slightly right skewed

balance
Number of left outliers present: 17
Number of right outliers present: 4712
measure of skewness: 8.360308326166326
Inference: The data is right skewed

day
Number of left outliers present: 0
Number of right outliers present: 0
measure of skewness: 0.09307901402122411
Inference: The data is uniformly distributed

duration
Number of left outliers present: 0
Number of right outliers present: 3235
measure of skewness: 3.144318099423456
Inference: The data is right skewed

campaign
Number of left outliers present: 0
Number of right outliers present: 3064
measure of skewness: 4.898650166179674
Inference: The data is right skewed

pdays
Number of left outliers present: 0
Number of right outliers present: 8257
measure of skewness: 2.6157154736563477
Inference: The data is right skewed

previous
Number of left outliers present: 0
Number of right outliers present: 8257
measure of skewness: 41.84645447266292
Inference: The data is right skewed

This time we will try a new plotting style called “violin plot” to understand the distribution of continuous input predictors across the discrete target outcome.

x = plt.subplots(nrows = 3,ncols=3,figsize=(16,15))
sns.violinplot(campaign_copy_df['age'], campaign_copy_df['Target'],ax = ax[0][0])
sns.violinplot(campaign_copy_df['balance'], campaign_copy_df['Target'],ax = ax[0][1])
sns.violinplot(campaign_copy_df['day'], campaign_copy_df['Target'],ax = ax[0][2])
sns.violinplot(campaign_copy_df['duration'], campaign_copy_df['Target'],ax = ax[1][0])
sns.violinplot(campaign_copy_df['campaign'], campaign_copy_df['Target'],ax = ax[1][1])
sns.violinplot(campaign_copy_df['pdays'], campaign_copy_df['Target'],ax = ax[1][2])
sns.violinplot(campaign_copy_df['previous'], campaign_copy_df['Target'],ax = ax[2][0])
fig.delaxes(ax[2,1])
fig.delaxes(ax[2,2])
Fig1 — displayed violin plot

Strong predictors include ‘duration’ and ‘pdays’. However ‘duration’ will not be used in the predictive modelling as it will not available in the test data(because the customer support team wont make any calls to get the feedback from the clients (i.e) the requirement itself).

Let’s apply spearman correlation to find out how the continuous variables are related to each other.

fig, ax = plt.subplots(figsize=(9,7))
ax = sns.heatmap(campaign_copy_df.corr(method='spearman'), ax = ax,annot=True, linewidths=1,fmt='.2f',cmap='magma')
bottom , top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

Inference from the Spearman correlation indicates, a strong relation between pdays and previous. So, the attribute ‘previous’ could be dropped. In addition to that we also check the discrete features relation using ANOVA test and chi-squared test.

Finally applying the support vector classifier to categorize the data,

from sklearn.svm import SVCsvc = SVC(kernel='linear')
svc.fit(train_x1, train_y)
complete_result.loc['svm-linear', 'train_score'] = svc.score(train_x1, train_y)
complete_result.loc['svm-linear', 'test_score'] = svc.score(test_x1, test_y)
predict_y = svc.predict(test_x1)
complete_result.loc['svm-linear', 'precision'] = precision_score(test_y, predict_y)
complete_result.loc['svm-linear', 'recall'] = recall_score(test_y, predict_y)
complete_result.loc['svm-linear', 'F1_score'] = f1_score(test_y, predict_y)
#print confusion matrix
cm = metrics.confusion_matrix(test_y,predict_y, labels = [1,0])
df = pd.DataFrame(cm , index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
ax = sns.heatmap(df,annot= True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
print("The confusion matrix of svm linear:\n", cm)The confusion matrix of svm linear:
[[ 289 1269]
[ 162 11844]]

Next, we can try with the radial basis function(rbf kernel trick) to check the improvement in the accuracy,

#lets use rbf kernel
svc_rbf = SVC(kernel='rbf', gamma='auto')
svc_rbf.fit(train_x1, train_y)
complete_result.loc['svm-rbf', 'train_score'] = svc_rbf.score(train_x1, train_y)
complete_result.loc['svm-rbf', 'test_score'] = svc_rbf.score(test_x1, test_y)
predict_y = svc_rbf.predict(test_x1)
complete_result.loc['svm-rbf', 'precision'] = precision_score(test_y, predict_y)
complete_result.loc['svm-rbf', 'recall'] = recall_score(test_y, predict_y)
complete_result.loc['svm-rbf', 'F1_score'] = f1_score(test_y, predict_y)
#print confusion matrix
cm = metrics.confusion_matrix(test_y,predict_y, labels = [1,0])
df = pd.DataFrame(cm , index = [i for i in ["1","0"]],
columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
ax = sns.heatmap(df,annot= True)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
print("The confusion matrix of svm rbf:\n", cm)The confusion matrix of svm rbf:
[[ 289 1269]
[ 162 11844]]

The entire code can be found in the github repository.

Recommended Reading:

AI Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store