Multivariable Regression in Machine Learning

Multivariable regression:

A more complex, multi-variable linear equation might look like this, where w represents the coefficients, or weights, our model will try to learn.

f(x,y,z)=w₁x+w₂y+w₃z

The variables x,y,z represent the attributes, or distinct pieces of information, we have about each observation. For sales predictions, these attributes might include a company’s advertising spend on radio, TV, and newspapers.

Sales=w₁Radio+w₂TV+w₃News

Let’s say we are given data on TV, radio, and newspaper advertising spend for a list of companies, and our goal is to predict sales in terms of units sold.

Company	TV	Radio	News	Units
Amazon	230.1	37.8	69.1	22.1
Google	44.5	39.3	23.1	10.4
Facebook	17.2	45.9	34.7	18.3
Apple	151.5	41.3	13.2	18.5

Growing complexity

As the number of features grows, the complexity of our model increases and it becomes increasingly difficult to visualize, or even comprehend, our data

One solution is to break the data apart and compare 1-2 features at a time. In this example we explore how Radio and TV investment impacts Sales.

Normalization

As the number of features grows, calculating gradient takes longer to compute. We can speed this up by “normalizing” our input data to ensure all values are within the same range. This is especially important for datasets with high standard deviations or differences in the ranges of the attributes. Our goal now will be to normalize our features so they are all in the range -1 to 1.

Algorithm:

For each feature column {

    #1 Subtract the mean of the column (mean normalization)

    #2 Divide by the range of the column (feature scaling)

Our input is a 200 x 3 matrix containing TV, Radio, and Newspaper data. Our output is a normalized matrix of the same shape with all values between -1 and 1.

Code

def normalize(features):

**

    features     -   (200, 3)

    features.T   -   (3, 200)

    We transpose the input matrix, swapping

    cols and rows to make vector math easier

**

    for feature in features.T:

        fmean = np.mean(feature)

        frange = np.amax(feature) - np.amin(feature)

        #Vector Subtraction

        feature -= fmean

        #Vector Division

        feature /= frange

    return features

Making predictions

Our predict function outputs an estimate of sales given our current weights (coefficients) and a company’s TV, radio, and newspaper spend. Our model will try to identify weight values that most reduce our cost function.

Sales=W1TV+W2Radio+W3Newspaper

def predict(features, weights):

**

  features - (200, 3)

  weights - (3, 1)

  predictions - (200,1)

**

  predictions = np.dot(features, weights)

  return predictions

Initialize weights

W1 = 0.0

W2 = 0.0

W3 = 0.0

weights = np.array([

    [W1],

    [W2],

    [W3]

])

Cost function

Now we need a cost function to audit how our model is performing. The math is the same, except we swap the mx+b expression for W₁x₁+W₂x₂+W₃x₃. We also divide the expression by 2 to make derivative calculations simpler.

def cost_function(features, targets, weights):

**

    features:(200,3)

    targets: (200,1)

    weights:(3,1)

    returns average squared error among predictions

**

    N = len(targets)

    predictions = predict(features, weights)

    # Matrix math lets use do this without looping

    sq_error = (predictions - targets)**2

    # Return average squared error among predictions

    return 1.0/(2*N) * sq_error.sum()

Gradient descent

Again using the chain rule we can compute the gradient–a vector of partial derivatives describing the slope of the cost function for each weight.

f′(W₁)=−x₁(y−(W₁x₁+W₂x₂+W₃x₃))

f′(W₂)=−x₂(y−(W₁x₁+W₂x₂+W₃x₃))

f′(W₃)=−x₃(y−(W₁x₁+W₂x₂+W₃x₃))

def update_weights(features, targets, weights, lr):

'''

    Features:(200, 3)

    Targets: (200, 1)

    Weights:(3, 1)

'''

    predictions = predict(features, weights)

    #Extract our features

    x1 = features[:,0]

    x2 = features[:,1]

    x3 = features[:,2]

    # Use dot product to calculate the derivative for each weight

    d_w1 = -x1.dot(targets - predictions)

    d_w2 = -x2.dot(targets - predictions)

    d_w2 = -x2.dot(targets - predictions)

    # Multiply the mean derivative by the learning rate

    # and subtract from our weights (remember gradient points in direction of steepest ASCENT)

    weights[0][0] -= (lr * np.mean(d_w1))

    weights[1][0] -= (lr * np.mean(d_w2))

    weights[2][0] -= (lr * np.mean(d_w3))

    return weights

And that’s it! Multivariate linear regression.

Simplifying with matrices

The gradient descent code above has a lot of duplication. Can we improve it somehow? One way to refactor would be to loop through our features and weights–allowing our function to handle any number of features. However there is another even better technique: vectorized gradient descent.

Math

We use the same formula as above, but instead of operating on a single feature at a time, we use matrix multiplication to operative on all features and weights simultaneously. We replace the xi terms with a single feature matrix X.

gradient=−X(targets−predictions)

Code

X = [

    [x1, x2, x3]

    [x1, x2, x3]

    [x1, x2, x3]

targets = [

    [1],

    [2],

[3]

def update_weights_vectorized(X, targets, weights, lr):

**

    gradient = X.T * (predictions - targets) / N

    X: (200, 3)

    Targets: (200, 1)

    Weights: (3, 1)

**

    companies = len(X)

    #1 - Get Predictions

    predictions = predict(X, weights)

    #2 - Calculate error/loss

    error = targets - predictions

    #3 Transpose features from (200, 3) to (3, 200)

    # So we can multiply w the (200,1)  error matrix.

    # Returns a (3,1) matrix holding 3 partial derivatives --

    # one for each feature -- representing the aggregate

    # slope of the cost function across all observations

    gradient = np.dot(-X.T,  error)

    #4 Take the average error derivative for each feature

    gradient /= companies

    #5 - Multiply the gradient by our learning rate

    gradient *= lr

    #6 - Subtract from our weights to minimize cost

    weights -= gradient

    return weights

Bias term

Our train function is the same as for simple linear regression, however we’re going to make one final tweak before running: add a bias term to our feature matrix.

In our example, it’s very unlikely that sales would be zero if companies stopped advertising. Possible reasons for this might include past advertising, existing customer relationships, retail locations, and salespeople. A bias term will help us capture this base case.

Code

Below we add a constant 1 to our features matrix. By setting this value to 1, it turns our bias term into a constant.

bias = np.ones(shape=(len(features),1))

features = np.append(bias, features, axis=1)

Model evaluation

After training our model through 1000 iterations with a learning rate of .0005, we finally arrive at a set of weights we can use to make predictions:

Sales=4.7TV+3.5Radio+.81Newspaper+13.9

Our MSE cost dropped from 110.86 to 6.25.

Introduction to Machine Learning