Logistic Regression
In the Linear Regression post we discussed about the 5 main questions of the Machine Learning Universe. Similarly in this post we will try to explain one of the Classification algorithm, Logistic Regression using these 5 questions-

- What Training Data is used?
- What Mathematical Function is used?
- Which Loss Function is used?
- How is the model trained?
- What Metric is used for evaluation?
Classification
From its name, it feels like this method has something to do with classification of some sort of data.
Yes!! its True.
Classification is a type of Supervised machine learning Algorithm and like any other supervised learning algorithms we have a data-set with some feature variables and a target variable. The only difference between this method and Linear Regression method is, its type of target variable.
In Linear Regression we study CONTINUOUS target variable where as in Classifications we study DISCRETE target variables. These discrete values of the target variables denote the different classes in the data-set. The aim is to find some relationship between these feature variables and the discrete classes of the target variable. As in what values of the feature variables of the data-point will classify the data-point into a certain class.
There are various classification methods. Some of the commonly used methods are Logistic Regression, Naive Bayes Classification , K-Nearest Neighbours, Support Vector Machine, Decision Trees and Random Forest Classification.
In this post we will discuss the Logistic Regression method.
Logistic Regression
Lets start with,
On what type of datasets we can apply the Logistic Regression methods?
As discussed in the Linear Regression post we know that the use of a machine learning method depends upon the type of training data in our data-set. So the prior analysis of the data-set is must before applying any Machine Learning Methods onto the data-set.
Considering an example,
We have a company data-set having SALARY and AGE as its feature variables and MODE OF TRANSPORT as its target variable. After prior analysis of the data-set we found out that there are two different classes in the target variable : Private Transport and Public Transport. Our task is to classify all the data-points in the data-set into these two classes and predict how will an employee reach office (using Private transport or Public Transport) based upon its Salary and Age.
For simplification we will consider only one feature variable so that we can visualise the results on 2-D plots. We will consider Salary as our feature as it shows higher correlation towards the Mode of Transport (more money more chances to use a private mode of transport 😉 ).
Before we plot the graph we need to do some data- processing we need to convert the categorical data (the classes in the target variable) into some numerical encoding as in machine learning we do not work with categorical variables. So we will encode Private Transport as '1' and Public Transport as '0'. The Feature plot looks like-
We can see that for the continuous values of SALARY the data-points are distributed in only two classes. Our next aim is to find the just right model which would fit onto this type of data-set and for every continuous value of SALARY we want to predict the class it will belong to.
What Mathematical Function is used?
We will try to analyse the Data-set using a linear regression model to actually understand our requirements.
Linear Regression Mathematical Function : y = b+w*x
b = biased parameter
w = weight parameter
x = feature variable
y = target variable(continuous)
The model line on our data-set looks like
The first thing we can observe is the high value of error for every data-point.
But this was going to happen after all Linear Regression predicts a Continuous value where as we require a Discrete output. Lets look at the model plot through a different angle. Lets consider the output value as a Probability of predicting one class...... What I mean is, if we look at out Target value as a probability of reaching office by Private Transport for an employee having some 'x' amount of salary . As we have also encoded the target value for Private Transport as '1' and Public Transport as '0' , this seems quite intuitive as probability lies between [0,1]. For the part of model line which extends in the positive direction above y=1 and in the negative direction below y=0 , lets consider the intersections to be c1 and c2 respectively. We can say that if the employee has salary greater than c1 he will definitely use a Private transport and for the person having salary less than c2 is definite to use Public Transport. This will Flatten the extensions of the model line above y=1 and below y=0 to y=1 and y = 0 , As shown below.

This will now restrict the output value in between [0,1] which we desire. Now we will predict the output value for each input SALARY, and consider that all the data-points with probability values between to 0-0.5 belong to class '0' means they Public Transport and the data-points having probability values greater than 0.5 to 1 belong to class '1' means they use Private Transport. This assumptions will reduce the error to a large extent and will produce some useful results.
Now the main question, Is there any mathematical function/equation which will fulfil all our requirements?
Yes!! There is one such Function....
The SIGMOID FUNCTION
The mathematical equation is : p = 1/(1+e-y)
here y is linear combination of all the feature values : y = b +w1x1+w2x2.....
Solving for p :
ln((1-p)/p) = y
ln((1-p)/p) = b +w1x1+w2x2.....
This is the mathematical equation for our model.
Which Loss Function is used?
Now we also need to compute the loss.
But unlike Linear Regression we cannot use the Mean Squared Error loss Function directly as we are not producing any continuous values and also using MSE means the date we are using is collected from a Gaussian distributed curve where as we are dealing with a data which has two discrete classes which is collected from Bernoulli Distribution. (you can google about the distributions to better understand). We need to penalise our model when its predicting output as '1' when its actual value is '0' and when the model is predicting '0' for the actual value '1'.
Mathematical Equation we need to transform out Loss function.
Model equation is:
hw(x) = 1/(1+e-y) and y = b +w1x1+w2x2.....
The derived loss for our two cases will be:
Loss(hw(x) ,y) = - log (hw(x)) for y=1
= - log (1-hw(x)) for y=0
The combined loss is called Cross Entropy loss = (- y*log(hw(x)) – (1-y)*log(1-hw(x)))
The Logistic Loss Function for m data points can be written as :
J(w) = (1/m)∑(Cross Entropy loss) summation over i∊[1,m] m is the number of data-points
J(w) = (1/m)∑(- y*log(hw(x)) – (1-y)*log(1-hw(x)))
We have to minimise this loss function while training the model.
How is the model Trained?
As seen before in the Linear Regression post we discussed the three different types of Gradient Descent methods which can be used to train our model . Similarly to train the Logistic Regression model we are going to use Gradient Descent Optimization method.
Calculating the Gradient of the Loss Function J(w) : (∂/∂wj)(J(wj))
(∂/∂wj)(J(wj)) = (∂/∂wj)∑(- y(i)*log(hwj(x(i))) – (1-y(i))*log(1-hwj(x(i))))
=∑((-y(i)/hwj(x(i)) + (1-y(i))/(1-hwj(xi)))(∂/∂wj)(hwj(x(i))) summation over i∊[1,m]
We know that hwj(x(i)) is a function of y and y in turn a function of wj , So using the rules of derivatives we can write (∂/∂w)(hw(x)) = (∂/∂y) (hw(x)) * (∂/∂w) y
(∂/∂y) (hw(x)) = (∂/∂y)(1/1 – e-y)
= e-y/(1 - e-y)2
= (1/(1 - e-y)) * (1 - 1/(1 - e-y))
= hw(x) * (1 – hw(x))
( ∂/∂wj) y = xj
(∂/∂wj)(J(wj)) = ∑((-y(i)/hwj(x(i)) + (1-y(i))/(1-hwj(xi))) *( hw(x)) * (1 – hw(x))) * xji
= ∑(-y(i) ( 1 − hwj(x(i)) ) + ( 1 -y(i))hwj(x(i))) xj(i)
(∂/∂wj)(J(wj)) = ∑ (hwj(x(i)) – y(i)) * xj(i) summation over i [1,m(number of datapoints)]
Now we will can update the weights as:
wj(new) = wj(old) + η*∑ (∂/∂wj)(J(wj)) summation over i∊[1,m] for the jth weight parameter
η is the learning rate , a Hyperparameter whose value is decided by the user.
The three Different Methods of Gradient Descent are :
Batch Gradient Descent : We update the weights after the calculation of the gradient of the total loss for all the data-points in our dataset.
Mathematically : wj(new) = wj(old) + η*∑ (∂/∂wj)(J(wj)) summation over i∊[1,n] for the jth weight parameter
Stochastic Gradient Descent : We update the weight after the calculation of gradient of the loss after every single data-point in our data-set.
Mathematically : wj(new) = wj(old) + η*(∂/∂wj)(J(wj)) for the jth weight parameter
Mini-Batch Gradient Descent : We update the weights after calculating the gradient of loss of k data-points at a time . k<n (total data-points in our data-set ).
Mathematically : wj(new) = wj(old) + η*∑ (∂/∂wj)(J(wj)) summation over i∊[1,k] for the jth weight parameter
(To understand more about Gradient Descent , read the previous post on Linear Regression)
After Training our model now comes the Evaluation of our model.
What Metric is used for Evaluation?
We can use any one of the metrics Accuracy, Precision, Recall and F-1 Score for evaluation of our Classification model.
Before this we need to compute the Confusion Matrix:
(here '1' and '-1' denote the two different classes)
Here;
TP - True Positives -The number of observations having both the actual and predicted values '1'
TN - True negatives -The number of observations having both the actual and predicted values '-1'
FP- False Positive -The number of observations where the actual values were '-1' our model predicted '1'
FN-False Negative -The number of observations where the actual values were '1' our model predicted '-1'
Depending on these values the metrics of evaluation are defined as:
Accuracy - (TP+TN)/(TP+TN+FN+FP)
It is the ratio of the correctly predicted observations to total number of data-points in the data-set.
Precision - (TP) / (TP+FP)
It is the ratio of correct(true positive) prediction among all the positive predicted values.
Recall - TP/(TP+FN)
It is the ratio of the correct(true positive) observations to that of all the Actual Positive Observations.
F-1 Score -( 2*Precision*Recall) / (Precision + Recall)
(here '1' and '-1' denote the two different classes)
Using any one of the above metric we can evaluate our model performance.
Our Classification models can also face the similar problems of Under Fitting and Over Fitting.
But the solution to this problem is also the same , we can change our Loss function as
J(w) = (1/m) ∑loss(hw(x(i)) ,y(i))+ λ/m * "model complexity" summation over i∊[1,m]
loss(hw(x(i)) ,y(i)) = (1/m)(- y(i)*log(hwj(x(i))) – (1-y(i))*log(1-hwj(x(i))))
here λ is known as the regularization rate. and model complexity defines the number of parameters of the model. Model complexity can be found by L1 or L2 Regularizations also known as Lasso and Ridge Regularizations.
Increasing the value of lambda will not allow the model to memorize the characteristics of the dataset and will prevent Over Fitting.
If the model is facing the problem of Under Fitting we can reduce the value of regularization rate and overcome underfitting. (the concepts of over-fitting, underfitting and regularization are discussed in the Linear Regression post)
Comments
Post a Comment