Binary classification is a type of supervised learning where the goal is to predict the category or class label of a given input based on its features. The number of class labels is finite, i.e., discrete.
Binary Classification
There are two class labels, e.g., Spam or Not Spam, Yes or No.
Multiclass Classification
More than two class labels, e.g., Lion, Tiger, or Leopard.
Multilabel Classification
Each instance can have multiple labels, e.g., an image can be tagged as both sunset and beach.
In binary classification, all instances are classified into two categories or classes, such as Yes/No or 1/0. The process of classification involves creating a decision boundary, which separates the two classes. This boundary can be linear or polynomial, depending on the complexity of the problem.
In regression, the model uses a combination of input features and adds a bias term, often using a linear combination. However, in binary classification, we need a training function that produces output in the range [0, 1]. This is where logistic regression comes in.
The training function for binary classification is defined as a probability function that maps inputs to values between 0 and 1:
This function outputs the probability of an observation being classified as 1. It is called logistic regression.
Bernoulli Trials
Each observation in binary classification can be treated as a Bernoulli trial, with two possible outcomes: 1 (success) or 0 (failure).
Probability of Observing a Single Data Point
Given the predicted probability (P) from logistic regression, the likelihood of observing the actual outcome (y) is modeled using the Bernoulli distribution:
For example, let's say we have a data point where the customer's age is 30, and our logistic regression model outputs a predicted probability (P = 0.8).
Thus, the predicted probability is high when the model's prediction matches the actual outcome.
Since each observation in the dataset is independent, the likelihood of observing the entire dataset is the product of individual observations:
Our goal is to maximize the likelihood of the entire dataset for the given set of (X) (features) and (w) (parameters).
Maximizing the product of probabilities can be computationally challenging, so we take the logarithm and convert the product into a sum. This makes the optimization process easier:
is given by
In machine learning, we often aim to minimize a loss function rather than maximize. Thus, we apply the negative log-likelihood to switch to a minimization problem:
is given by
This function measures how well the predicted probabilities align with the actual data. If the model predicts the correct class with high confidence, the loss is low. Conversely, it penalizes being confident and wrong.
The cost for misclassifying one observation is:
Directly using the cost (J(w)) can lead to problems such as:
To address this, we use the average:
is given by
This gives us the average loss per observation, providing insight into model performance.
Gradient descent in logistic regression follows a similar procedure as in linear regression, but with the sigmoid function involved. We use the chain rule to calculate the gradients.
For the weight (w):
For the bias (b):
The weights and bias are updated using the calculated gradients in each iteration of gradient descent.
In binary classification, the goal is to classify instances into two distinct categories. Logistic regression is a widely used approach that outputs probabilities between 0 and 1, and the model is trained by optimizing the log-likelihood. Proper optimization and interpretation of the loss function are crucial for building a successful binary classifier.