# Day 46(ML) — Implementation of Support Vector Machine for classification

Let’s implement the support vector classifier using sci-kit learn package from python. The idea for regression also follows a similar approach as classification, except we try to find the best fit line by minimizing the L2 norm of the weights(hard margin). For the soft margin, we allow some misclassifications controlled by a slack variable epsilon and the parameter ‘C’ that defines how much tolerance the model has towards the errors.

** Implementation of Support Vector Classifier: **We will be using the bank campaign dataset for our analysis. The aim is to forecast if the client will subscribe to a term deposit(Yes/No). This comes under the classification problem statement. We have a mix of input predictors both the numeric and categorical.

campaign_df=pd.read_csv('bank-full.csv')#take a deep copy of the datasetcampaign_copy_df=copy.deepcopy(campaign_df)#display the first five rowscampaign_copy_df.head()

Checking whether each column has respective information as per the description.

#Numerical values check

print('Check whether any of the numeric columns has negative data:')num_cols = campaign_copy_df.select_dtypes(include = ['int64'])num_count = (num_cols < 0).sum()

print(num_count)print("\nDistinct negative values corresponding to 'pdays'")

pdays_negative_values = campaign_copy_df['pdays'][campaign_copy_df['pdays'] < 0].value_counts()

print(pdays_negative_values)print("\nINFERENCE : (i) 3766 customers have negative balance.\n"

" (ii) 36954 customers have -1 pdays implying\n"

" the client was not contacted before.\n")

Check whether any of the numeric columns has negative data:

age 0

balance 3766

day 0

duration 0

campaign 0

pdays 36954

previous 0

dtype: int64

Distinct negative values corresponding to 'pdays'

-1 36954

Name: pdays, dtype: int64

We can observe that around 3766 customers have a negative balance and 36954 customers have -1 pdays implying the client was not contacted before.

Next step is understanding the five-point summary of numerical attributes.

#five point summary calculation

describe = campaign_copy_df.describe()for i in describe:

print('\nMean of {}:'.format(i), describe[i]['mean'])

print('Median of {}:'.format(i), campaign_copy_df[i].median())

print('Std of {}:'.format(i),describe[i]['std'])

print('INFERENCE:', inference[i])campaign_copy_df.describe()Mean of age: 40.93621021432837

Median of age: 39.0

Std of age: 10.618762040975431INFERENCE: There is a slight diff between mean and median.

This indicates presence of outliers and skewness.

Mean of balance: 1362.2720576850766

Median of balance: 448.0

Std of balance: 3044.7658291686002INFERENCE: There is a huge diff between mean and median.

The standard deviation is very high.

This indicates presence of large number of outliers and right skewed as mean > median.

Mean of day: 15.80641879188693

Median of day: 16.0

Std of day: 8.322476153044185INFERENCE: There is only a slight diff between mean and median.

This could be a sign of minimal or no outliers.

Mean of duration: 258.1630797814691

Median of duration: 180.0

Std of duration: 257.52781226517095INFERENCE: There is a huge diff between mean and median.

The standard deviation is very high.

This indicates presence of large number of outliers and right skewed as mean > median.

Mean of campaign: 2.763840658246887

Median of campaign: 2.0

Std of campaign: 3.0980208832802205INFERENCE: There is a diff between mean and median.

This indicates presence of outliers and right skewness.

Mean of pdays: 40.19782796222158

Median of pdays: -1.0

Std of pdays: 100.1287459906047INFERENCE: There is a huge diff between mean and median.

The standard deviation is very high.

This indicates presence of large number of outliers and right skewed as mean > median.

Mean of previous: 0.5803233726305546

Median of previous: 0.0

Std of previous: 2.3034410449314233INFERENCE: There is a significant diff between mean and median.

The standard deviation is very high.

This indicates presence of large number of outliers and right skewed as mean > median.

Another way of validating the outliers is by using the interquartile range. *Notes: Since our main focus is on the SVM, simple univariate techniques are used but we can always try out different outlier mechanisms using PYOD package discussed before.*

#outliers calculation

Q1 = campaign_copy_df.quantile(0.25)

Q3 = campaign_copy_df.quantile(0.75)

IQR = Q3-Q1

lower_bound = Q1 - (1.5*IQR)

upper_bound = Q3 + (1.5*IQR)for i in describe:

print('\n',i)

left_outlier_count = (campaign_copy_df[i] < lower_bound[i]).sum()

right_outlier_count = (campaign_copy_df[i] > upper_bound[i]).sum()

print('Number of left outliers present:', left_outlier_count)

print('Number of right outliers present:', right_outlier_count)

print('measure of skewness:', campaign_copy_df[i].skew())

print('Inference: ', inference[i])age

Number of left outliers present: 0

Number of right outliers present: 487

measure of skewness: 0.6848179257252598Inference: The data is slightly right skewed

balance

Number of left outliers present: 17

Number of right outliers present: 4712

measure of skewness: 8.360308326166326Inference: The data is right skewed

day

Number of left outliers present: 0

Number of right outliers present: 0

measure of skewness: 0.09307901402122411Inference: The data is uniformly distributed

duration

Number of left outliers present: 0

Number of right outliers present: 3235

measure of skewness: 3.144318099423456Inference: The data is right skewed

campaign

Number of left outliers present: 0

Number of right outliers present: 3064

measure of skewness: 4.898650166179674Inference: The data is right skewed

pdays

Number of left outliers present: 0

Number of right outliers present: 8257

measure of skewness: 2.6157154736563477Inference: The data is right skewed

previous

Number of left outliers present: 0

Number of right outliers present: 8257

measure of skewness: 41.84645447266292Inference: The data is right skewed

This time we will try a new plotting style called ** “violin plot”** to understand the distribution of continuous input predictors across the discrete target outcome.

`x = plt.subplots(nrows = 3,ncols=3,figsize=(16,15))`

sns.violinplot(campaign_copy_df['age'], campaign_copy_df['Target'],ax = ax[0][0])

sns.violinplot(campaign_copy_df['balance'], campaign_copy_df['Target'],ax = ax[0][1])

sns.violinplot(campaign_copy_df['day'], campaign_copy_df['Target'],ax = ax[0][2])

sns.violinplot(campaign_copy_df['duration'], campaign_copy_df['Target'],ax = ax[1][0])

sns.violinplot(campaign_copy_df['campaign'], campaign_copy_df['Target'],ax = ax[1][1])

sns.violinplot(campaign_copy_df['pdays'], campaign_copy_df['Target'],ax = ax[1][2])

sns.violinplot(campaign_copy_df['previous'], campaign_copy_df['Target'],ax = ax[2][0])

fig.delaxes(ax[2,1])

fig.delaxes(ax[2,2])

Strong predictors include ‘duration’ and ‘pdays’. However ‘duration’ will not be used in the predictive modelling as it will not available in the test data(because the customer support team wont make any calls to get the feedback from the clients (i.e) the requirement itself).

Let’s apply spearman correlation to find out how the continuous variables are related to each other.

`fig, ax = plt.subplots(figsize=(9,7))`

ax = sns.heatmap(campaign_copy_df.corr(method='spearman'), ax = ax,annot=**True**, linewidths=1,fmt='.2f',cmap='magma')

bottom , top = ax.get_ylim()

ax.set_ylim(bottom + 0.5, top - 0.5)

Inference from the Spearman correlation indicates, a strong relation between pdays and previous. So, the attribute ‘previous’ could be dropped. In addition to that we also check the discrete features relation using ANOVA test and chi-squared test.

Finally applying the support vector classifier to categorize the data,

from sklearn.svm import SVCsvc = SVC(kernel='linear')

svc.fit(train_x1, train_y)

complete_result.loc['svm-linear', 'train_score'] = svc.score(train_x1, train_y)

complete_result.loc['svm-linear', 'test_score'] = svc.score(test_x1, test_y)

predict_y = svc.predict(test_x1)

complete_result.loc['svm-linear', 'precision'] = precision_score(test_y, predict_y)

complete_result.loc['svm-linear', 'recall'] = recall_score(test_y, predict_y)

complete_result.loc['svm-linear', 'F1_score'] = f1_score(test_y, predict_y)#print confusion matrix

cm = metrics.confusion_matrix(test_y,predict_y, labels = [1,0])

df = pd.DataFrame(cm , index = [i for i in ["1","0"]],

columns = [i for i in ["Predict 1","Predict 0"]])

plt.figure(figsize = (7,5))

ax = sns.heatmap(df,annot= True)

bottom, top = ax.get_ylim()

ax.set_ylim(bottom + 0.5, top - 0.5)print("The confusion matrix of svm linear:\n", cm)The confusion matrix of svm linear:

[[ 289 1269]

[ 162 11844]]

Next, we can try with the radial basis function(rbf kernel trick) to check the improvement in the accuracy,

#lets use rbf kernel

svc_rbf = SVC(kernel='rbf', gamma='auto')

svc_rbf.fit(train_x1, train_y)

complete_result.loc['svm-rbf', 'train_score'] = svc_rbf.score(train_x1, train_y)

complete_result.loc['svm-rbf', 'test_score'] = svc_rbf.score(test_x1, test_y)

predict_y = svc_rbf.predict(test_x1)

complete_result.loc['svm-rbf', 'precision'] = precision_score(test_y, predict_y)

complete_result.loc['svm-rbf', 'recall'] = recall_score(test_y, predict_y)

complete_result.loc['svm-rbf', 'F1_score'] = f1_score(test_y, predict_y)#print confusion matrix

cm = metrics.confusion_matrix(test_y,predict_y, labels = [1,0])

df = pd.DataFrame(cm , index = [i for i in ["1","0"]],

columns = [i for i in ["Predict 1","Predict 0"]])

plt.figure(figsize = (7,5))

ax = sns.heatmap(df,annot= True)

bottom, top = ax.get_ylim()

ax.set_ylim(bottom + 0.5, top - 0.5)print("The confusion matrix of svm rbf:\n", cm)The confusion matrix of svm rbf:

[[ 289 1269]

[ 162 11844]]

The entire code can be found in the github repository.

**Recommended Reading:**