What is Linear Regression?
Linear Regression is a supervised machine learning
algorithm where the predicted output is continuous and has a constant slope.
It’s used to predict values within a continuous range, (e.g. sales, price)
rather than trying to classify them into categories (e.g. cat, dog). There are
two main types:
Simple regression
Simple linear regression uses traditional slope-intercept form,
where m and c are the
variables our algorithm will try to “learn” to produce the most accurate predictions. x represents our input data and y represents our prediction.
y=mx+c
Multivariable regression
A more complex, multi-variable linear equation might look like this,
where w represents the coefficients, or weights, our model will try to
learn.
f(x,y,z)=w1x+w2y+w3z
The variables x, y, z represent the
attributes, or distinct pieces of information, we have about each observation.
For sales predictions, these attributes might include a company’s
advertising spend on radio, TV, and newspapers.
Sales=w1Radio+w2TV+w3News
Simple
regression:
Let’s say we are given a dataset with the following columns (features): how much a
company spends on Radio advertising each year and its annual Sales in terms of
units sold. We are trying to develop an equation that will let us to predict
units sold based on how much a company spends on radio advertising. The rows (observations)
represent companies.
Company |
Radio
($) |
Sales |
Amazon |
37.8 |
22.1 |
Google |
39.3 |
10.4 |
Facebook |
45.9 |
18.3 |
Apple |
41.3 |
18.5 |
Making
predictions
Our
prediction function outputs an estimate of sales given a company’s radio
advertising spend and our current values for Weight and Bias
Sales=Weight⋅Radio+Bias
Where Weight ie. the
coefficient for the Radio independent variable. In machine learning we call
coefficients weights.
Radio the independent variable.
In machine learning we call these variables features.
Bias
the intercept where our line intercepts the y-axis. In
machine learning we can call intercepts bias. Bias offsets all predictions that we make.
Our
algorithm will try to learn the correct values for Weight and Bias. By the end of
our training, our equation will approximate the line of best fit.
Code
def predict_sales(radio, weight, bias):
return weight*radio + bias
Cost function:
The
prediction function is nice, but for our purposes we don’t really need it. What
we need is a cost function so we can start optimizing our weights.
Let’s
use MSE (L2) as our cost function. MSE measures the average squared
difference between an observation’s actual and predicted values. The output is
a single number representing the cost, or score, associated with our current
set of weights. Our goal is to minimize MSE to improve the accuracy of our
model.
Math
Given our simple linear equation
· N is the total number of
observations (data points)
Code
def cost_function(radio, sales, weight, bias):
companies = len(radio)
total_error = 0.0
for i in range(companies):
total_error += (sales[i] - (weight*radio[i] + bias))**2
return total_error / companies
Gradient descent:
To
minimize MSE we use Gradient Descent to calculate the gradient of our cost
function. Gradient descent consists of looking at the error that our weight
currently gives us, using the derivative of the cost function to find the
gradient (The slope of the cost function using our current weight), and then
changing our weight to move in the direction opposite of the gradient. We need
to move in the opposite direction of the gradient since the gradient points up
the slope instead of down it, so we move in the opposite direction to try to
decrease our error.
Math
and bias . Since we need to consider the impact each one has on the final prediction, we use partial derivatives. To find the partial derivatives, we use the Chain rule. We need the chain rule because
Returning
to our cost function:
Code
To
solve for the gradient, we iterate through our data points using our new weight
and bias values and take the average of the partial derivatives. The resulting
gradient tells us the slope of our cost function at our current position (i.e.
weight and bias) and the direction we should update to reduce our cost function
(we move in the direction opposite the gradient). The size of our update is
controlled by the learning rate.
def update_weights(radio, sales, weight, bias, learning_rate):
weight_deriv = 0
bias_deriv = 0
companies = len(radio)
for i in range(companies):
# Calculate partial derivatives
# -2x(y - (mx + b))
weight_deriv += -2*radio[i] * (sales[i] - (weight*radio[i] + bias))
# -2(y - (mx + b))
bias_deriv += -2*(sales[i] - (weight*radio[i] + bias))
# We subtract because the derivatives point in direction of steepest ascent
weight -= (weight_deriv / companies) * learning_rate
bias -= (bias_deriv / companies) * learning_rate
return weight, bias
Training:
Training a model is the process of iteratively improving
your prediction equation by looping through the dataset multiple times, each
time updating the weight and bias values in the direction indicated by the
slope of the cost function (gradient). Training is complete when we reach an
acceptable error threshold, or when subsequent training iterations fail to
reduce our cost.
Before training we need to initialize our weights (set
default values), set our hyperparameters (learning
rate and number of iterations), and prepare to log our progress over each
iteration.
Code
def train(radio, sales, weight, bias, learning_rate, iters):
cost_history = []
for i in range(iters):
weight,bias = update_weights(radio, sales, weight, bias, learning_rate)
#Calculate cost for auditing purposes
cost = cost_function(radio, sales, weight, bias)
cost_history.append(cost)
# Log Progress
if i % 10 == 0:
print "iter={:d} weight={:.2f} bias={:.4f} cost={:.2}".format(i, weight, bias, cost)
return weight, bias, cost_history
Model evaluation:
If
our model is working, we should see our cost decrease after every iteration.
Logging
iter=1 weight=.03 bias=.0014 cost=197.25
iter=10 weight=.28 bias=.0116 cost=74.65
iter=20 weight=.39 bias=.0177 cost=49.48
iter=30 weight=.44 bias=.0219 cost=44.31
iter=30 weight=.46 bias=.0249 cost=43.28
Visualizing
Summary
By
learning the best values for weight (.46) and bias (.25), we now have an
equation that predicts future sales based on radio advertising investment.
How
would our model perform in the real world? I’ll let you think about it :)
Comments
Post a Comment