Skip to main content

Multivariable Regression in Machine Learning

 

Multivariable regression:

A more complex, multi-variable linear equation might look like this, where w represents the coefficients, or weights, our model will try to learn.

 f(x,y,z)=w1x+w2y+w3z

The variables x,y,z represent the attributes, or distinct pieces of information, we have about each observation. For sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers.

Sales=w1Radio+w2TV+w3News

Let’s say we are given data on TV, radio, and newspaper advertising spend for a list of companies, and our goal is to predict sales in terms of units sold.

Company

TV

Radio

News

Units

Amazon

230.1

37.8

69.1

22.1

Google

44.5

39.3

23.1

10.4

Facebook

17.2

45.9

34.7

18.3

Apple

151.5

41.3

13.2

18.5

 

Growing complexity

As the number of features grows, the complexity of our model increases and it becomes increasingly difficult to visualize, or even comprehend, our data



One solution is to break the data apart and compare 1-2 features at a time. In this example we explore how Radio and TV investment impacts Sales.

Normalization

As the number of features grows, calculating gradient takes longer to compute. We can speed this up by “normalizing” our input data to ensure all values are within the same range. This is especially important for datasets with high standard deviations or differences in the ranges of the attributes. Our goal now will be to normalize our features so they are all in the range -1 to 1.

Algorithm:
 
For each feature column {
    #1 Subtract the mean of the column (mean normalization)
    #2 Divide by the range of the column (feature scaling)
}
Our input is a 200 x 3 matrix containing TV, Radio, and Newspaper data. Our output is a normalized matrix of the same shape with all values between -1 and 1.
 
Code
def normalize(features):
    **
    features     -   (200, 3)
    features.T   -   (3, 200)
    We transpose the input matrix, swapping
    cols and rows to make vector math easier
    **
    for feature in features.T:
        fmean = np.mean(feature)
        frange = np.amax(feature) - np.amin(feature)
        #Vector Subtraction
        feature -= fmean
        #Vector Division
        feature /= frange
    return features
 

 Making predictions

Our predict function outputs an estimate of sales given our current weights (coefficients) and a company’s TV, radio, and newspaper spend. Our model will try to identify weight values that most reduce our cost function.

Sales=W1TV+W2Radio+W3Newspaper

def predict(features, weights):
  **
  features - (200, 3)
  weights - (3, 1)
  predictions - (200,1)
  **
  predictions = np.dot(features, weights)
  return predictions

 

Initialize weights

W1 = 0.0
W2 = 0.0
W3 = 0.0
weights = np.array([
    [W1],
    [W2],
    [W3]
])

 

Cost function

Now we need a cost function to audit how our model is performing. The math is the same, except we swap the mx+b expression for W1x1+W2x2+W3x3. We also divide the expression by 2 to make derivative calculations simpler.




def cost_function(features, targets, weights):
    **
    features:(200,3)
    targets: (200,1)
    weights:(3,1)
    returns average squared error among predictions
    **
    N = len(targets)
 
    predictions = predict(features, weights)
 
    # Matrix math lets use do this without looping
    sq_error = (predictions - targets)**2
 
    # Return average squared error among predictions
    return 1.0/(2*N) * sq_error.sum()

 

Gradient descent

Again using the chain rule we can compute the gradient–a vector of partial derivatives describing the slope of the cost function for each weight.

f(W1)=−x1(y−(W1x1+W2x2+W3x3))

f(W2)=−x2(y−(W1x1+W2x2+W3x3))

f(W3)=−x3(y−(W1x1+W2x2+W3x3))

def update_weights(features, targets, weights, lr):
    '''
    Features:(200, 3)
    Targets: (200, 1)
    Weights:(3, 1)
    '''
    predictions = predict(features, weights)
 
    #Extract our features
    x1 = features[:,0]
    x2 = features[:,1]
    x3 = features[:,2]
 
    # Use dot product to calculate the derivative for each weight
    d_w1 = -x1.dot(targets - predictions)
    d_w2 = -x2.dot(targets - predictions)
    d_w2 = -x2.dot(targets - predictions)
 
    # Multiply the mean derivative by the learning rate
    # and subtract from our weights (remember gradient points in direction of steepest ASCENT)
    weights[0][0] -= (lr * np.mean(d_w1))
    weights[1][0] -= (lr * np.mean(d_w2))
    weights[2][0] -= (lr * np.mean(d_w3))
 
    return weights

And that’s it! Multivariate linear regression.

Simplifying with matrices

The gradient descent code above has a lot of duplication. Can we improve it somehow? One way to refactor would be to loop through our features and weights–allowing our function to handle any number of features. However there is another even better technique: vectorized gradient descent.

Math

We use the same formula as above, but instead of operating on a single feature at a time, we use matrix multiplication to operative on all features and weights simultaneously. We replace the xi terms with a single feature matrix X.

gradient=−X(targetspredictions)

Code

X = [
    [x1, x2, x3]
    [x1, x2, x3]
    .
    .
    .
    [x1, x2, x3]
]
 
targets = [
    [1],
    [2],
    [3]
]
 
def update_weights_vectorized(X, targets, weights, lr):
    **
    gradient = X.T * (predictions - targets) / N
    X: (200, 3)
    Targets: (200, 1)
    Weights: (3, 1)
    **
    companies = len(X)
 
    #1 - Get Predictions
    predictions = predict(X, weights)
 
    #2 - Calculate error/loss
    error = targets - predictions
 
    #3 Transpose features from (200, 3) to (3, 200)
    # So we can multiply w the (200,1)  error matrix.
    # Returns a (3,1) matrix holding 3 partial derivatives --
    # one for each feature -- representing the aggregate
    # slope of the cost function across all observations
    gradient = np.dot(-X.T,  error)
 
    #4 Take the average error derivative for each feature
    gradient /= companies
 
    #5 - Multiply the gradient by our learning rate
    gradient *= lr
 
    #6 - Subtract from our weights to minimize cost
    weights -= gradient
 
    return weights

Bias term

Our train function is the same as for simple linear regression, however we’re going to make one final tweak before running: add a bias term to our feature matrix.

In our example, it’s very unlikely that sales would be zero if companies stopped advertising. Possible reasons for this might include past advertising, existing customer relationships, retail locations, and salespeople. A bias term will help us capture this base case.

Code

Below we add a constant 1 to our features matrix. By setting this value to 1, it turns our bias term into a constant.

bias = np.ones(shape=(len(features),1))
features = np.append(bias, features, axis=1)

Model evaluation

After training our model through 1000 iterations with a learning rate of .0005, we finally arrive at a set of weights we can use to make predictions:

Sales=4.7TV+3.5Radio+.81Newspaper+13.9

Our MSE cost dropped from 110.86 to 6.25.










 

Comments