Multivariable regression:
A more complex,
multi-variable linear equation might look like this, where w represents the
coefficients, or weights, our model will try to learn.
f(x,y,z)=w1x+w2y+w3z
The variables x,y,z represent the attributes, or distinct pieces of
information, we have about each observation. For sales predictions, these
attributes might include a company’s advertising spend on radio, TV, and
newspapers.
Sales=w1Radio+w2TV+w3News
Let’s say we are given data on
TV, radio, and newspaper advertising spend for a list of companies, and our
goal is to predict sales in terms of units sold.
Company
|
TV
|
Radio
|
News
|
Units
|
Amazon
|
230.1
|
37.8
|
69.1
|
22.1
|
Google
|
44.5
|
39.3
|
23.1
|
10.4
|
Facebook
|
17.2
|
45.9
|
34.7
|
18.3
|
Apple
|
151.5
|
41.3
|
13.2
|
18.5
|
Growing complexity
As the number of features grows, the complexity of our model increases and it becomes increasingly difficult to visualize, or even comprehend, our data
One solution is to break the data apart and compare 1-2
features at a time. In this example we explore how Radio and TV investment
impacts Sales.
Normalization
As the number of features grows, calculating gradient takes
longer to compute. We can speed this up by “normalizing” our input data to
ensure all values are within the same range. This is especially important for
datasets with high standard deviations or differences in the ranges of the
attributes. Our goal now will be to normalize our features so they are all in
the range -1 to 1.
Algorithm:
For each feature column {
#1 Subtract the mean of the column (mean normalization)
#2 Divide by the range of the column (feature scaling)
}
Our input is a 200 x 3 matrix containing TV, Radio, and Newspaper data. Our output is a normalized matrix of the same shape with all values between -1 and 1.
Code
def normalize(features):
**
features - (200, 3)
features.T - (3, 200)
We transpose the input matrix, swapping
cols and rows to make vector math easier
**
for feature in features.T:
fmean = np.mean(feature)
frange = np.amax(feature) - np.amin(feature)
#Vector Subtraction
feature -= fmean
#Vector Division
feature /= frange
return features
Our predict function outputs an estimate of sales given our
current weights (coefficients) and a company’s TV, radio, and newspaper spend.
Our model will try to identify weight values that most reduce our cost
function.
Sales=W1TV+W2Radio+W3Newspaper
def predict(features, weights):
**
features - (200, 3)
weights - (3, 1)
predictions - (200,1)
**
predictions = np.dot(features, weights)
return predictions
Initialize weights
W1 = 0.0
W2 = 0.0
W3 = 0.0
weights = np.array([
[W1],
[W2],
[W3]
])
Cost function
Now we need a cost function to audit how our model is performing. The math is the same, except we swap the mx+b expression for W1x1+W2x2+W3x3. We also divide the expression by 2 to make derivative calculations simpler.
def cost_function(features, targets, weights):
**
features:(200,3)
targets: (200,1)
weights:(3,1)
returns average squared error among predictions
**
N = len(targets)
predictions = predict(features, weights)
# Matrix math lets use do this without looping
sq_error = (predictions - targets)**2
# Return average squared error among predictions
return 1.0/(2*N) * sq_error.sum()
Gradient descent
Again using the chain rule we can compute the
gradient–a vector of partial derivatives describing the slope of the cost
function for each weight.
f′(W1)=−x1(y−(W1x1+W2x2+W3x3))
f′(W2)=−x2(y−(W1x1+W2x2+W3x3))
f′(W3)=−x3(y−(W1x1+W2x2+W3x3))
def update_weights(features, targets, weights, lr):
'''
Features:(200, 3)
Targets: (200, 1)
Weights:(3, 1)
'''
predictions = predict(features, weights)
#Extract our features
x1 = features[:,0]
x2 = features[:,1]
x3 = features[:,2]
# Use dot product to calculate the derivative for each weight
d_w1 = -x1.dot(targets - predictions)
d_w2 = -x2.dot(targets - predictions)
d_w2 = -x2.dot(targets - predictions)
# Multiply the mean derivative by the learning rate
# and subtract from our weights (remember gradient points in direction of steepest ASCENT)
weights[0][0] -= (lr * np.mean(d_w1))
weights[1][0] -= (lr * np.mean(d_w2))
weights[2][0] -= (lr * np.mean(d_w3))
return weights
And that’s it! Multivariate linear regression.
Simplifying with matrices
The gradient descent code above has a lot of duplication. Can we
improve it somehow? One way to refactor would be to loop through our features
and weights–allowing our function to handle any number of features. However
there is another even better technique: vectorized
gradient descent.
Math
We use the same formula as
above, but instead of operating on a single feature at a time, we use matrix
multiplication to operative on all features and weights simultaneously. We
replace the xi terms with a single
feature matrix X.
gradient=−X(targets−predictions)
Code
X = [
[x1, x2, x3]
[x1, x2, x3]
.
.
.
[x1, x2, x3]
]
targets = [
[1],
[2],
[3]
]
def update_weights_vectorized(X, targets, weights, lr):
**
gradient = X.T * (predictions - targets) / N
X: (200, 3)
Targets: (200, 1)
Weights: (3, 1)
**
companies = len(X)
#1 - Get Predictions
predictions = predict(X, weights)
#2 - Calculate error/loss
error = targets - predictions
#3 Transpose features from (200, 3) to (3, 200)
# So we can multiply w the (200,1) error matrix.
# Returns a (3,1) matrix holding 3 partial derivatives --
# one for each feature -- representing the aggregate
# slope of the cost function across all observations
gradient = np.dot(-X.T, error)
#4 Take the average error derivative for each feature
gradient /= companies
#5 - Multiply the gradient by our learning rate
gradient *= lr
#6 - Subtract from our weights to minimize cost
weights -= gradient
return weights
Bias term
Our train function is the same as for simple linear regression,
however we’re going to make one final tweak before running: add a bias term to our feature
matrix.
In our example, it’s very unlikely that sales would be zero if
companies stopped advertising. Possible reasons for this might include past
advertising, existing customer relationships, retail locations, and
salespeople. A bias term will help us capture this base case.
Code
Below we add a constant 1 to our features matrix. By setting
this value to 1, it turns our bias term into a constant.
bias = np.ones(shape=(len(features),1))
features = np.append(bias, features, axis=1)
Model evaluation
After training our model through 1000 iterations with a learning
rate of .0005, we finally arrive at a set of weights we can use to make predictions:
Sales=4.7TV+3.5Radio+.81Newspaper+13.9
Our MSE cost dropped from 110.86 to 6.25.
Comments
Post a Comment