Skip to main content

Logistic Regression

 

What is Logistic Regression?

Logistic regression is Regression as well as Classification machine learning algorithm that comes under Supervised Learning techniques. Logistic regression is used to predict the categorical dependent variable with the help of independent variables. Logistic regression converts its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

Comparison to linear regression

Logistic Regression

Linear Regression

Solves the classification problems

Solves the regression problems mostly

Predict the categorical dependent variables using set of independent variables

Predict the continuous dependent variables using set of independent variables

To estimate the accuracy, Maximum likelihood estimation method is used

To estimate the accuracy, Least square estimation method is used

We get Categorical value such as 0 or 1 Yes or No, etc. as output

We get continuous value, such as price, age, etc. as output


Types of logistic regression

  1. Binary (Pass/Fail)
  2. Multi (Cats, Dogs, Sheep)
  3. Ordinal (Low, Medium, High)

Say we’re given data on student exam results and our goal is to predict whether a student will pass or fail based on number of hours slept and hours spent studying. We have two features (hours slept, hours studied) and two classes: passed (1) and failed (0).

Studied

Slept

Passed

4.85

9.63

1

8.62

3.23

0

5.43

8.23

1

9.21

6.34

0


Graphically we could represent our data with a scatter plot.


Sigmoid activation

In order to map predicted values to probabilities, we use the sigmoid function. The function maps any real value into another value between 0 and 1. In machine learning, we use sigmoid to map predictions to probabilities.

                                          


·       s(z) = output between 0 and 1 (probability estimate)

·       z = input to the function (your algorithm’s prediction e.g. mx + b)

·       e = base of natural log

    

Code

def sigmoid(z):

  return 1.0 / (1 + np.exp(-z))

Decision boundary

Our current prediction function returns a probability score between 0 and 1. In order to map this to a discrete class (true/false, cat/dog), we select a threshold value or tipping point above which we will classify values into class 1 and below which we classify values into class 2.

p0.5,class=1p<0.5,class=0p≥0.5,class=1p<0.5,class=0

For example, if our threshold was .5 and our prediction function returned .7, we would classify this observation as positive. If our prediction was .2 we would classify the observation as negative. For logistic regression with multiple classes we could select the class with the highest predicted probability.


Making predictions

Using our knowledge of sigmoid functions and decision boundaries, we can now write a prediction function. A prediction function in logistic regression returns the probability of our observation being positive, True, or “Yes”. We call this class 1 and its notation is P(class=1)P(class=1). As the probability gets closer to 1, our model is more confident that the observation is in class 1.

Math

Let’s use the same multiple linear regression equation from our linear regression tutorial.

z=W0+W1Studied+W2Slept

This time however we will transform the output using the sigmoid function to return a probability value between 0 and 1.

If the model returns .4 it believes there is only a 40% chance of passing. If our decision boundary was .5, we would categorize this observation as “Fail.”


Code

We wrap the sigmoid function over the same prediction function we used in multiple linear regression

def predict(features, weights):
  '''
  Returns 1D array of probabilities
  that the class label == 1
  '''
  z = np.dot(features, weights)
  return sigmoid(z)

 

Cost function

Unfortunately we can’t (or at least shouldn’t) use the same cost function MSE (L2) as we did for linear regression. Why? There is a great math explanation in chapter 3 of Michael Neilson’s deep learning book, but for now I’ll simply say it’s because our prediction function is non-linear (due to sigmoid transform). Squaring this prediction as we do in MSE results in a non-convex function with many local minimums. If our cost function has many local minimums, gradient descent may not find the optimal global minimum.

Math

Instead of Mean Squared Error, we use a cost function called Cross-Entropy, also known as Log Loss. Cross-entropy loss can be divided into two separate cost functions: one for y=1 and one for y=0.



The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for y=1 and y=0. These smooth monotonic functions  (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost. Image from Andrew Ng’s slides on logistic regression.


The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on reducing cost due to the logistic nature of our cost function.

Above functions compressed into one

Multiplying by y and (1y) in the above equation is a sneaky trick that let’s us use the same equation to solve for both y=1 and y=0 cases. If y=0, the first side cancels out. If y=1, the second side cancels out. In both cases we only perform the operation we need to perform.


Vectorized cost function


Code

def cost_function(features, labels, weights):
    '''
    Using Mean Absolute Error
 
    Features:(100,3)
    Labels: (100,1)
    Weights:(3,1)
    Returns 1D matrix of predictions
    Cost = (labels*log(predictions) + (1-labels)*log(1-predictions) ) / len(labels)
    '''
    observations = len(labels)
 
    predictions = predict(features, weights)
 
    #Take the error when label=1
    class1_cost = -labels*np.log(predictions)
 
    #Take the error when label=0
    class2_cost = (1-labels)*np.log(1-predictions)
 
    #Take the sum of both costs
    cost = class1_cost - class2_cost
 
    #Take the average cost
    cost = cost.sum() / observations
 
    return cost

Gradient descent

To minimize our cost, we use Gradient Descent just like before in Linear Regression. There are other more sophisticated optimization algorithms out there such as conjugate gradient like BFGS, but you don’t have to worry about these. Machine learning libraries like Scikit-learn hide their implementations so you can focus on more interesting things!

Math

One of the neat properties of the sigmoid function is its derivative is easy to calculate. If you’re curious, there is a good walk-through derivation on stack overflow. Michael Neilson also covers the topic in chapter 3 of his book.

s(z)=s(z)(1s(z))

Which leads to an equally beautiful and convenient cost function derivative:

C=x(s(z)−y)

·       C is the derivative of cost with respect to weights

·       is the actual class label (0 or 1)

·       s(z) is your model’s prediction

·       x is your feature or feature vector.

Notice how this gradient is the same as the MSE (L2) gradient, the only difference is the hypothesis function.

Pseudocode

Repeat {
 
  1. Calculate gradient average
  2. Multiply by learning rate
  3. Subtract from weights
 
}

Code

def update_weights(features, labels, weights, lr):
    '''
    Vectorized Gradient Descent
 
    Features:(200, 3)
    Labels: (200, 1)
    Weights:(3, 1)
    '''
    N = len(features)
 
    #1 - Get Predictions
    predictions = predict(features, weights)
 
    #2 Transpose features from (200, 3) to (3, 200)
    # So we can multiply w the (200,1)  cost matrix.
    # Returns a (3,1) matrix holding 3 partial derivatives --
    # one for each feature -- representing the aggregate
    # slope of the cost function across all observations
    gradient = np.dot(features.T,  predictions - labels)
 
    #3 Take the average cost derivative for each feature
    gradient /= N
 
    #4 - Multiply the gradient by our learning rate
    gradient *= lr
 
    #5 - Subtract from our weights to minimize cost
    weights -= gradient
 
    return weights

Mapping probabilities to classes

The final step is assign class labels (0 or 1) to our predicted probabilities.

Decision boundary

def decision_boundary(prob):
  return 1 if prob >= .5 else 0

Convert probabilities to classes

def classify(predictions):
  '''
  input  - N element array of predictions between 0 and 1
  output - N element array of 0s (False) and 1s (True)
  '''
  decision_boundary = np.vectorize(decision_boundary)
  return decision_boundary(predictions).flatten()

Example output

Probabilities = [ 0.967, 0.448, 0.015, 0.780, 0.978, 0.004]
Classifications = [1, 0, 0, 1, 1, 0]

Training

Our training code is the same as we used for linear regression.

def train(features, labels, weights, lr, iters):
    cost_history = []
 
    for i in range(iters):
        weights = update_weights(features, labels, weights, lr)
 
        #Calculate error for auditing purposes
        cost = cost_function(features, labels, weights)
        cost_history.append(cost)
 
        # Log Progress
        if i % 1000 == 0:
            print "iter: "+str(i) + " cost: "+str(cost)
 
    return weights, cost_history

Model evaluation

If our model is working, we should see our cost decrease after every iteration.

iter: 0 cost: 0.635
iter: 1000 cost: 0.302
iter: 2000 cost: 0.264

Final cost: 0.2487. Final weights: [-8.197, .921, .738]

Cost history


Accuracy

Accuracy measures how correct our predictions were. In this case we simply compare predicted labels to true labels and divide by the total.

def accuracy(predicted_labels, actual_labels):
    diff = predicted_labels - actual_labels
    return 1.0 - (float(np.count_nonzero(diff)) / len(diff))

Decision boundary

Another helpful technique is to plot the decision boundary on top of our predictions to see how our labels compare to the actual labels. This involves plotting our predicted probabilities and coloring them with their true labels.

Code to plot the decision boundary

def plot_decision_boundary(trues, falses):
    fig = plt.figure()
    ax = fig.add_subplot(111)
 
    no_of_preds = len(trues) + len(falses)
 
    ax.scatter([i for i in range(len(trues))], trues, s=25, c='b', marker="o", label='Trues')
    ax.scatter([i for i in range(len(falses))], falses, s=25, c='r', marker="s", label='Falses')
 
    plt.legend(loc='upper right');
    ax.set_title("Decision Boundary")
    ax.set_xlabel('N/2')
    ax.set_ylabel('Predicted Probability')
    plt.axhline(.5, color='black')

    plt.show()




Comments

Popular posts from this blog

Introduction to Machine Learning

What is Machine learning? The meaning of Machin Learning is machine learns itself with its experience.  Machine learning (ML) is a type of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so. ML improve the performance of a computer program using sample data or past experience.  We have to create a model and then we have to train the model which will provide the prediction on the basis of machine learning algorithms. For eg: A robot learns with its past experience and improve itself.   Classification of Machine Learnings algorithms: Supervised Learning: Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.  It is defined by its use of labelled datasets to train algorithms that to classify data or predict outcomes accurately. For eg:   Supposed to identify the fruits in the basket we mu...