Support Vector Machine

As we know that any Machine Learning Algorithm or Method can be described using 5 main Questions.

  • What type of Training Data is Used?
  • What Mathematical Function is used?
  • What is Loss Function?
  • How is the Model trained?
  • What Metrics is used for Model Evaluation?
As discussed earlier, the classification task of Machine Learning can be done using various methods. We discussed the Logistic Regression in the previous post, in this post we will journey through one of the  State-of-the-Art methods for the classification task, Support Vector Machine. SVM's have proved to give some really useful and amazing results on the Unseen Data with high accuracy. 

Support Vector Machine 

Let's consider an example: We are given a classification task where our aim is to classify animals into Aquatic animals and Land animals. As Humans, we can complete the task either by

#1. Understanding the feature (like how they breathe, what is their food, what is their habitat, behavior, and many more ) of all obvious aquatic animals like fishes and similar features of land animals like tigers, goats, etc. This is the most obvious way to do the classification.

#2. Another way would be understanding the extreme cases of each class. Studying why an animal which despite belonging to a class exhibit some features of the other class and then finding the limits of these feature values we can do the classification. Like in the case of crabs, turtles, hippopotamus, polar bears, these show features of land animals and aquatic animals both (we are considering aquatic and land as our only classes of classification). After studying, we can find the limits of these feature values that would classify them into Aquatic or Land.

Most Machine Learning Methods use the first case to build the model and extract useful information from the dataset. But SVM uses the other way to study the dataset and build a model. This is the uniqueness of the SVM method which has proved to be a useful way to classify the data with high accuracy.
SVM Model




What type of dataset is used?

As discussed earlier, for classification tasks, we work on the dataset having some feature variables and a target variable. The feature variables can be continuous but the target variable is discrete. In our example 'Type of animal' is our target variable and Aquatic and Land being it's two classes. How they breathe, Food, Habitat, Body Characteristics being its feature variables.
As we have seen we do not work with categorical variables in ML, so we need to encode these target classes into some numerical value. 
let it be yi ∊ (-1,1), i is the number of training samples in our dataset (n training samples)
Classificatio Dataset


What is the mathematical function?

From the data plot, it is clear that there can be an infinite number of ways to create a partition boundary to separate the classes. The colored area shows ways to build the classification boundary. But we have to find the Just Right model that completes our classification task.
SVM models



Note - this plot is for visualization, as the number of features can be more and it is not possible to visualize an nth dimensional plot in 2D or 3D

In the nth dimension, we represent each data point as a vector and a line as a plane. So each data point is a vector and our model line is a Hyperplane. These extreme points of each class are called the Support Vectors and the complete SVM model is dependent on these support vectors. The distance between these Supporting Hyperplane is called the Separating Margin. SVM tries to separate the class by creating a Separating Hyperplane(model line) at the place where the two classes are furthest apart. This means this separating hyperplane is at equal distance from these supporting hyperplanes.  

Equation of our separating hyperplane will be :
wT*x + b = 0
w = the 'm' weight parameters for each feature variable 
x =  the 'm' feature variables 
b = the bias vector

What loss function is used?

SVM can be bottled down to an optimization problem where we have to maximize the separating margin and find the best separating Hyperplane with the constraint that all points must lie on the correct side of the Hyperplane.
But why maximize the separating margin?
The reason is that it will generate the best Separating Hyperplane which will be furthest from both the supporting vectors. This will reduce the prediction error on the unseen data to a large extent. This is what our aim is, to generate the Just Right model for the given dataset.
For this first, we need to find the separating margin,
we know the Separating Hyperplane equation is: wtx+b = 0
let's consider the data point x1 be the closest of the class to the separating Hyperplane.
Supporting Hyperplanes


xp : be the perpendicular projection of x1 on the hyperplane.
r: be the perpendicular distance of the point from the hyperplane

xp = x1- r

as r ⊥ w we can write: r = αw


as xp lies on the hyperplane then: wT*xp + b = 0
0 = wT*xp + b
0 = wT*(x1- r)+ b
0 = wT*(x1- αw) + b
αwTw = wT*x+ b
Value of α:
α = (wT*x+ b)/(wTw)

Length of r = ||r||
||r|| = √(rTr)
     = √(α*α*wTw) 
     = |α|√(wTw)
||r|| = (|wT*x+ b|)/ (||w||2)

As we have considered the hyperplane to be the best separating hyperplane means it must be at an equal distance from both the supporting hyperplanes.

The Separating margin: γ (w,b)
γ (w,b) = 2||r||
γ (w,b) = 2*{(|wtx1+b|)/ (||w||2)}

We can also see that γ (w,b) is scale-invariant i.e γ (βw,βb) =  γ (w,b)
(we can also cross-check using the Separating Hyperplane equation)
Hence we can choose the value of |wT*x+ b| = 1 as we can fix the scale of w and b anyway we want, and |wT*x+ b| = 1 will simplify the equations to a great extent.
 
Supporting Hyperplanes: Using |wT*x+ b| = 1 we can write the equations of supporting hyperplanes
positive Hyperplane:  wT*x + b = 1
negative Hyperplane: wT*x + b = -1

We also have to take into account that each data is correctly classified i.e.
wT*x+ b >= 1  for   yi =1
wT*x+ b <= -1 for   yi =-1
combined condition:

yi *(wT*x+ b) >=1 

for datapoints in the dataset(i ∊[1,n]) 


Complete optimization problem:
max γ (w,b)  such that yi *(wtxi+b) >=1 ∀ i [1,n]
max 2*{(|wT*x+ b|)/ (||w||2)}  such that   yi *(wT*x+ b) >=1        ∀     i∊[1,n]
max {2/ (||w||2)}  such that   yi *(wT*x+ b) >=1 ∀ i [1,n] and |wT*x+ b| = 1
The final objective is to find a solution to the following optimization problem:
min {||w||2/2} such that yi *(wT*x+ b) >=1 ∀  i ∊[1,n]

But in certain cases, mostly if the data is low dimensional then this strict restriction does not produce any separating hyperplane. in this case, there is no solution to the above optimization problem. so, we have to relax these strict restrictions to find a solution. This relaxation is done by introducing the 'Slack' Variable 'ξi'. This concept was introduced by Vladimir Vapnik. 

What we do is: 
yi *(wT*xi + b)>=1-ξi  
This will relax the rigid condition and will allow some data points to violate our separating hyperplane concept and come close to the separating hyperplane, but we will also impose some penalty for such violations. 
min (1/2)*||w||+ C*∑ξi  summation over i, i ∊[1,n]
where,
ξi  = 0                                   yi *(wT*x+ b)>=1
     = 1- yi *(wT*x+ b)        yi *(wT*x+ b)<1

Overll ξi  = max(0,1-yi *(wT*x+ b))
Now we can write the complete Loss Function as :
J(w) =  min (1/2)*||w||2 + C*∑max(0,1-yi *(wT*x+ b))           
summation over i, i ∊[1,n]

First-term of J(w) is the L-2 Regularization term and the second term is called as Hinge-loss.
The C in Hinge-Loss controls the extent of relaxation. It is a Hyperparameter, its value is to be decided by the user. If the value of C is chosen to be very large then there is a chance of underfitting (High Bias and Low Variance). And if the value of C is chosen to very small then there is a chance of overfitting( Low Bias and High Variance)

How is the model Trained?

J(w) =  min (1/2)*||w||2 + C*∑max(0,1-yi *(wT*x+ b)) ; summation over i, i ∊[1,n]

This Function is convex in w. ( max function is also a convex function.) We can use the Gradient Descent Method for training our SVM. We follow the following steps:-
  1. We first initialize some small values to the weights.
  2. We then compute the Gradient of the loss function.
  3. Then we use wnew = wold - (learning Rate)*(Gradient of loss) equation to update the weights.
But the problem here is though the hinge-loss function is a convex function it is not a differentiable function. So how to find the Gradient?
Maximum Function convex function



Here we use the concept of sub-gradient which says that a function 'g' is gradient to function 'f'  at point x if it satisfies f(y) >= f(x) + gt(y-x)  y 

So we first calculate the maximum and then we compute the gradient. We get the following: 
∇J(w) = 0                    if max(0,1-yi *(wT*x+ b)) = 0
           = w - Cyixi       otherwise

Now we update the weights as :

wnew  = (1-α) wold  
           = (1-α) wold  + α*Cyixi   

To update we generally use Stochastic Gradient Descent as Batch and Mini-Batch are computationally expensive in the case of SVM training. 

What Metrics are used for model evaluation?

Metrics used for model evaluation are Accuracy, Precision, Recall, and F-1 Score.
(These topics were discussed in the Logistic Regression post).
First, we need to compute the Confusion Matrix. Then we can use any one of the Metrics mentioned above for model evaluation.
Confusion Matrix



Accuracy - Ratio of all Correct predictions to all the data points in the data.

Precision - Ratio of all Correct True Positives to all the Positive Predictions.

Recall - Ratio of all Correct True Positives to all the Actual Positive Values.

F-1 Score( 2*Precision*Recall) / (Precision + Recall)





Comments

Popular posts from this blog

Linear Regression

Logistic Regression