Linear Regression

Joe Richard
5 min readDec 29, 2020

As we knew that there are three types of learning in a Machine Learning

  1. supervised learning
  2. non-supervised learning
  3. reinforcement learning

Under these Linear Regression comes under Supervised learning.

So what basically is Linear Regression ?

As the name suggests(Regression) we are going to predict a real number values as output.

Why should we use linear regression?

Consider a data with full of real numbers as both input and output.In that case The Linear Regression Algorithms works well out of all the algorithms.

Before we get into this let us learn about a few important parameters

Weights: Weights are the parameters which denotes the importance of each feature(columns).It is denoted in various symbols. In Machine Learning it is denoted as θ. They are also called as the slope.Weights or Slope help in deciding the angle of the linear line.

Example consider you are going to buy a mobile phone. You want security as an important feature. Here “security” gets a higher weight than other feature.

so our model will predict a mobile phone with better security and suggest you(probably an Iphone)

Bias: Bias are the parameters which denotes where our Linear line fits the model. Which is also called as the intercept.

Hypothesis: It describes the functionality of the model

Cost function: this is used to check how good our model predicts.It is denoted as j(θ).It also helps us to find the difference between the predicted value and real value.The whole concept of cost function is nothing but of MEAN SQUARE ERROR.

Now Lets get back into Linear Regression!

The main formula for a Linear Regression Model is

Y=c+mx

Here,

Y = hypothesis of the model or functionality of the model

c= Intercept of the model

m=slope of the model(here x is feature of the dataset)

If there is only two features in the dataset it is called as 2D plane.If there is more than two features it is called as a hyperplane.

Remember that having only one feature as a Training data will provide you a Linear line.

Loss Function

The loss is the error in our predicted value of a and b. Our goal is to minimize this error to obtain the most accurate value of a and b.
We will use the cost function(mean square error)to calculate the loss.There are three steps in this function:

  1. Find the difference between the actual y and predicted y value(y = mx + c), for a given x.
  2. Square this difference.
  3. Find the mean of the squares for every value in X.

Mean Squared Error Equation

Here yᵢ is the actual value and ȳᵢ is the predicted value. Lets substitute the value of ȳᵢ:

Substituting the value of ȳᵢ

So we square the error and find the mean. hence the name Mean Squared Error. Now that we have defined the loss function, lets get into the interesting part — minimizing it and finding m and c.

The Gradient Descent Algorithm

Gradient descent is an iterative optimization algorithm to find the minimum of a function. Here that function is our Loss Function.

Understanding Gradient Descent

image 1

Firstly we are initializing the value with some (minimal values).The reson for this is unless we don’t give a initializing value our model won’t get up.This method of initializing the model is also called as whole stock model.

Lets see an example for the Learning rate with the above image 1. There are two friends A and B. A is in the starting point and B is in the Convergence point. A has to reach his friend B so that he walks down. The speed of his walking can be defined as learning rate.As we can see the path from A to B is so steep A has to walk slowly. If he walks fast he will fall down.So that’s why we are having a very low learning rate.

Now lets see how Gradient Decent works?

consider a dataset

X Y
1 1
2 3
4 3
3 2
5 5

Initial model

θ0=0.0

θ1=0.0

p=θ0+θ1*x

error = p-y

Iteration1

X=1, Y=1 (as per dataset)

P=0.0+0.0*1

p=0

error =0–1

error=-1

Rule for updated parameter θ0=θ0(t)-learningrate*error

=0.0–0.01*-1

=0.01

Rule for updated parameter θ1=θ1(t)-learningrate*error*X

=0.0–0.01*-1*1

=0.01

Iteration2

X=2, Y=3 (as per dataset)

θ0=0.1

θ1=0.1

P=0.1+0.1*2

p=0.03

error =0.03–3

error=-2.97

Rule for updated parameter θ0=θ0(t)-learningrate*error

=0.01–0.01*-2.97

=0.0397

Rule for updated parameter θ1=θ1(t)-learningrate*error*X

=0.01–0.01*-2.97*2

=0.0694

Iteration3

X=4, Y=3 (as per dataset)

θ0=0.0397

θ1=0.0694

P=0.0397+0.0694*4

p=0.3173

error =0.3173–3

error=-2.6827

Rule for updated parameter θ0=θ0(t)-learningrate*error

=0.0397–0.01*-2.6827

=0.066527

Rule for updated parameter θ1=θ1(t)-learningrate*error*X

=0.0397–0.01*-2.6827*4

=0.176708

So, We must continue our iteration until our model reaches the convergence point.

This is how values of θ0 and θ1 are updated.

Gradient descent is one of the simplest and widely used algorithms in machine learning, mainly because it can be applied to any function to optimize it. Learning it lays the foundation to mastering machine learning.

Got questions ? Need help ? Contact me!

Email: joe101richard@gmail.com

instagram:joe___richard

twitter:@JoeRichard101

References:

--

--

Joe Richard

“Predicting the future isn’t magic, it’s artificial intelligence”.