Linear Regression

Whenever we talk about any Machine Learning Algorithms, we talk about these 5 main questions:-

  • What type of Training Data used?
  • What Mathematical Function is used?
  • Which Loss Function is used?
  • How is the model Trained?
  • What Metrics is used for evaluation?
Everything in the Machine Learning Universe revolves around these 5 questions. There are many popular machine learning algorithms that are commonly used. Regression ,Classification , Clustering ,etc. are some of these .
In this post we will be discussing the Regression Algorithm.

Regression 

Regression is a type of Supervised Learning Algorithm, where we have a set of Feature variables and a corresponding Continuous response variable. The task is to find some relationship between these two and try to predict the continuous response variable for every new set of feature variables.
There are various Regression algorithms namely Linear Regression, Support Vector Regression, Decision Trees Regression, Random Forest Regression. In this blog post we will be discussing the Linear Regression Algorithm the most basic Machine Learning Algorithm.

Linear Regression  

We will try to study the model using the questions mentioned above and check whether it really explains it all or not. So lets find out!

What type of Training Data is Used?

The question should me more like which machine learning model to be used after studying the data-set. As the type of model used solely depend upon the the type of data(features/variables of the data-set) . But to think, what would have come first ? A data-set or the model? What do you think? Let me know in the comments.

Now if the data-set has some features and depending on those we have a continuous variable and the task is to predict the continuous variable then Regression is what we should use.
for example.
consider a company data-set whose variables are - Work Experience and Salary.
if we consider our Work Experience feature as our independent variable and Salary as the dependent variable and after plotting the data points in feature space it looks like
Regression Data points in feature space
As we can see there is a linear relationship between the variables.
In such kind of data-sets we use Regression.

What Mathematical Function is used?

As we saw earlier the features showed some linear relation so we can write the model equation as

hω,b (x) = b+ω1x12x23x3+......+ωmxm

b = the biased parameter
ωi = the weight parameters (i = 1,2,3,......m)
xi = the independent vaiables (i = 1,2,3,......m)

In case of our example :We have only one independent variable so our model equation is reduced to 
hω,b (x) = b+ω1x1
which represents a straight line in feature plane , plotting the model line onto the feature plane will look like
Linear Regression model
Here the red points represent the actual values and the blue line is our predicted model .
We can clearly see that there is error in our prediction. In Machine Learning we look for the solution which has the minimum error. As we know practically the Salary(dependent variable) is not the only function of Years of Experience(independent variable). There are many other factors which may vary depending on the company.
But really to obtain a model with a zero training error is not possible in Machine learning unless our model over-fits on our Training data. Which is also not good. We aim to find the Just Right solutions(explained in the post later).
So we aim for the best possible solution i.e. minimum error.

What is the Loss Function?

Observing our model we say that our actual value and predicted values did not match. So the error is the difference between the actual value and the predicted value.
Considering 'yi'  to be our actual value and 'hω,b (x)' being out predicted value and one more important thing to observe, some observations have higher predicted value and some have lower predicted value as compared to the actual values. Taking this into account we write the error function for Linear Regression as :-

J(ω,b) = (1/2)∑[(hω,b (xi) - yi )*(hω,b (xi) - yi )] 

J(ω,b) = (1/2)∑[(b+ωixi - yi )*(b+ωixi - yi )]                            ;summation over 'i' (i.e. number of points)

As ω and b are the parameters of the model, the loss function is depended in these values.
The loss function J(ω,b) plotted against ω and b :-






Here the z-axis represents the loss J(ω,b) 
x-axis and y-axis represent parameters ω and b.
Loss Parameter Curve 3D
Regression Model curves

Each point on the Loss Parameter Graph represents a unique set of parameter values (ω,b) which gives a unique model equation . This is the duality in the loss space and the feature space.
And the best model would be one having parameter values represented by the bottom tip of the loss function graph as its clear, it would give the minimum loss value. 
These values of ω and b are called as the optimal values.

Now the question arises , How to get these optimal values? such that the loss function is minimised.
This is where our Optimisation Algorithms come into action. And also the time to answer our next question.

How is the Model Trained? 

Finding the optimal values of the parameter through an iterative process is called Training our model. Done by the optimisation algorithm.

Optimisation Algorithm:- 

The most common one, is the Gradient Descent , it is widely used optimisation algorithm. 

For simplicity purpose lets consider our biased parameter 'b' to be 0. Now our loss function vs ω vs b graph would be reduced to a parabola in 2D space as shown in the figure below

Loss Parameter Cure 2D, Gradient Descent
If we consider the initial point to be 'i' as displayed in the image means we initialise some non zero value to the weight parameter and using that we get a loss using the loss function . Our aim is to reach the final point 'f' . How to reach there?
The Optimisation method works in a pattern where we calculate the gradient at the point, then if the gradient/slope at that point is '-ve' ( like in our case ) then we move in the positive ω direction and if the slope is '+ve' we move in negative ω direction. The amount we move in any direction is decided be the Learning Rate 'α' which decides the length of the stride we are going to travel. Then the new parameter value obtained is calculated as-

ω(new)i = ω(old)i - α * gradient

And repeating this process for some iteration our final destination is reached. 
Now a query may arise that, what happens if we repeat this process even further?

Its simple! 

At point 'f' the gradient becomes zero and our equation becomes ω(new)i = ω(old)i , in other words our  equation converges. 
The Learning Rate (α): It is a Hyper-parameter(value decided by the user) ,plays very important role in our learning process. For this we will plot the graph of Loss vs No. of Iterations:
Loss Vs Iteration Curve,Learning Rate
 This graph shows the variation of the loss with number of iterations for a specific value of α , 
smaller the value long is the learning process , large number of iterations required to reach the optimal value of parameters. Very large value develops the possibility of jumping to the other side of the Loss Parameter cure , instead of descending downwards. This value should be wisely selected.  
This is how the Gradient Descent Algorithm works. True to its name it is like we are walking down( descending) a hill.

Calculation the Gradient: 

Gradient  = (∂/∂ωj)(J(ω,b))
J(ω,b) = (1/2)∑[(hω,b (xi) - yi )*(hω,b (xi) - yi )]                      ;summation over 'i' , number of data points
(∂/∂ωj)(J(ω,b)) =(∂/∂ωj)(1/2)∑[(hω,b (xi) - yi )*(hω,b (xi) - yi )]       ;j = number of parameter (1,2,......m)
                         = ∑ (hω,b (xi) - yi )*[ (∂/∂ωj) hω,b (xi)]          ;(hω,b (xi) = b+ω1x12x23x3+......+ωmxm)
                         ∑ (hω,b (xi) - yi ) * x(j)i 

(∂/∂ωj)(J(ω,b)) = ∑ (hω,b (xi) - yi ) * x(j)i  ;             summation over 'i' number of data points , j ∈ [1,m] 

If we find the summation over all the n points in the data-set then calculate the gradient and then update the weights Its called as Batch Gradient Descent.
(∂/∂ωj)(J(ω,b)) = ∑ (hω,b (xi) - yi ) * x(j)i                                         summation over 'i' [1,n] ; j ∈ [1,m]
(∂/∂b)(J(ω,b)) = ∑ (hω,b (xi) - yi )

If we find the gradient and then update the weight after each data point , this method is called as Stochastic Gradient Descent.    
(∂/∂ωj)(J(ω,b)) =  (hω,b (xi) - yi ) * x(j)i                                                                     'i' [1,n] ; j ∈ [1,m]
(∂/∂b)(J(ω,b)) =  (hω,b (xi) - yi )

If we find the gradient of k data points k<n and then update the weights this is known as 
Mini-Batch Gradient Descent.
(∂/∂ωj)(J(ω,b)) = ∑ (hω,b (xi) - yi ) * x(j)i                                 summation over 'i' [1,k] , k<n ; j ∈ [1,m]
(∂/∂b)(J(ω,b)) = ∑ (hω,b (xi) - yi )   
generally the value of k used is 2 raised to power a ,a is an integer 

This is how we Train a model.

What Metric is Used for Evaluation?

Standard methods of evaluation in Regression is :-
  • Mean Squared Error (MSE): This is the average of all the squared errors of n points. This method we have used in our example 


  • Mean Absolute Error (MAE): This is also used sometimes for evaluation of our performance.

What Happens after Training the Model?

Three possibilities; either the model:
  • Under-fit
  • Just Right Fit
  • Over-fit
Underfitting, overfitting,and Just Right Curves Linear RegressionLoss vs Model Complexity Curve

  • Under-fitting is the situation where our model has not learned anything from our data and is incapable of giving satisfactory results. Occurs when we obtain very high Training Loss and High Validation Loss.

  • Over-fitting occurs when our model memorises all the properties of the Training Data and predicts all the values of the Training Data with very less error but fails to perform on our validation data and produces very high Validation Error.

  • The models which are Just Right Fit produce good and desirable results on Training Data as well as on Validation Data. This type of models have less Training loss as well as less Validation Loss.

The Model Complexity is the number of features/parameters in our training data.

To Fix the Under-fitting and Over-fitting problem

  • From the Loss -Model Complexity Graph we can see that if the model is experiencing then we can fix the problem by increasing the Model complexity, this can be done by either performing Feature Cross or Polynomial Expansion.
    • Feature Cross - We make some additional features my multiplying the existing features like x1,x2 being our original features and after Feature Cross we can get one additional feature as x1*x2 . 
    • Polynomial Expansion - We make use of the Polynomial expansion of the existing features like x1, x2 being our original features and after Polynomial Expansion of our existing features give x1*x1, x2*x2, x1*x2. This Regression is called as Polynomial Regression. But even if  the name says polynomial regression and uses polynomial features this does not mean that it is a Non Linear Regression . It still remains a part of Linear Regression. as its model equation still is a linear combination of the features                         hω,b (x) = b+ω1x12x23x1*x14x2*x2 5x1*x2
  • To fix Overfitting we can even try to increase the amount of data for Training.
Another way to fix Over-Fitting and Under-Fitting is by Regularization.

  • We make a slight change in the loss function as: J(ω,b) + λ * "model complexity"                    Here λ is known as Regularization Rate (a Hyper Parameter whose value is decided by the user). Model Complexity can be found by two methods :
    • L1 Regularization - Σ ||ωi||  = Σωi      Summation over 'i'(no. of parameters). It is just the sum of absolute values of all parameter weights. The Regression model using L1 Regularization is called as Lasso Regression.
    • L2 Regularizaion - Σ ||ωi||  = Σωi * ωi  Summation over 'i'(no. of parameters). It is the sum of square of all the parameter weights.The Regression model using L2 regularization is called as Ridge Regression.
What we are doing is we are giving some importance to the model complexity by assigning some weight to the it. If our model is Experiencing Over-fitting we will increase the value of λ that will not allow the parameter weights to memorise the features characteristics. If in a case the model is experiencing Under-fitting then we can reduce the value of λ. This way we can fix the Under-fitting and Over-fitting problem.

Comments

Popular posts from this blog

Support Vector Machine

Logistic Regression