Notes

Personal notes on various topics

View on GitHub

Logistic Regression: A Comprehensive Guide

Table of Contents

  1. Introduction
  2. Mathematical Foundation
  3. Types of Logistic Regression
  4. Algorithm Implementation
  5. Assumptions
  6. Evaluation Metrics
  7. Advantages and Disadvantages
  8. Common Interview Questions
  9. Practical Considerations
  10. Code Examples

Introduction

Logistic Regression is a fundamental statistical method used for binary and multiclass classification problems. Despite its name containing “regression,” it is actually a classification algorithm that models the probability of class membership using the logistic (sigmoid) function. It serves as a cornerstone in machine learning and provides an excellent foundation for understanding more complex classification algorithms.

Key Concepts:

Mathematical Foundation

Binary Logistic Regression

The logistic regression model uses the sigmoid (logistic) function to map any real-valued input to a value between 0 and 1:

\[P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n)}}\]

Or more compactly:

\[P(Y=1|X) = \frac{1}{1 + e^{-\mathbf{X}\boldsymbol{\beta}}}\]

Where:

Sigmoid Function Properties

The sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ has several important properties:

\[\sigma'(z) = \sigma(z)(1 - \sigma(z))\]

Odds and Log-Odds (Logit)

Odds: The ratio of probability of success to probability of failure:

\[\text{Odds} = \frac{P(Y=1|X)}{P(Y=0|X)} = \frac{P(Y=1|X)}{1 - P(Y=1|X)}\]

Log-Odds (Logit): The natural logarithm of odds:

\[\text{logit}(P) = \ln\left(\frac{P}{1-P}\right) = \mathbf{X}\boldsymbol{\beta}\]

This is the linear component of logistic regression, showing that logistic regression models the log-odds as a linear function of the features.

Maximum Likelihood Estimation

Unlike linear regression, logistic regression cannot be solved analytically. We use Maximum Likelihood Estimation (MLE) to find optimal parameters.

Likelihood Function: For binary classification:

\[L(\boldsymbol{\beta}) = \prod_{i=1}^{m} P(Y=y^{(i)}|X^{(i)})\] \[L(\boldsymbol{\beta}) = \prod_{i=1}^{m} [h_{\boldsymbol{\beta}}(x^{(i)})]^{y^{(i)}} [1 - h_{\boldsymbol{\beta}}(x^{(i)})]^{1-y^{(i)}}\]

Log-Likelihood: Taking the natural log for easier optimization:

\[\ell(\boldsymbol{\beta}) = \sum_{i=1}^{m} [y^{(i)} \log(h_{\boldsymbol{\beta}}(x^{(i)})) + (1-y^{(i)}) \log(1 - h_{\boldsymbol{\beta}}(x^{(i)}))]\]

Cost Function (Cross-Entropy Loss)

The negative log-likelihood gives us the cross-entropy loss:

\[J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_{\boldsymbol{\beta}}(x^{(i)})) + (1-y^{(i)}) \log(1 - h_{\boldsymbol{\beta}}(x^{(i)}))]\]

Gradient: The partial derivatives for gradient descent:

\[\frac{\partial J}{\partial \beta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_{\boldsymbol{\beta}}(x^{(i)}) - y^{(i)}) x_j^{(i)}\]

In matrix form:

\[\nabla J = \frac{1}{m} \mathbf{X}^T (h_{\boldsymbol{\beta}}(\mathbf{X}) - \mathbf{y})\]

Multiclass Logistic Regression

For multiple classes, we extend to multinomial logistic regression using the softmax function:

\[P(Y=k|X) = \frac{e^{\mathbf{X}\boldsymbol{\beta}_k}}{\sum_{j=1}^{K} e^{\mathbf{X}\boldsymbol{\beta}_j}}\]

Where $K$ is the number of classes and $\boldsymbol{\beta}_k$ is the coefficient vector for class $k$.

Types of Logistic Regression

1. Binary Logistic Regression

2. Multinomial Logistic Regression

3. Ordinal Logistic Regression

4. Regularized Logistic Regression

L1 Regularization (Lasso)

\(J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_{\boldsymbol{\beta}}(x^{(i)})) + (1-y^{(i)}) \log(1 - h_{\boldsymbol{\beta}}(x^{(i)}))] + \lambda \sum_{j=1}^{n} |\beta_j|\)

L2 Regularization (Ridge)

\(J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(h_{\boldsymbol{\beta}}(x^{(i)})) + (1-y^{(i)}) \log(1 - h_{\boldsymbol{\beta}}(x^{(i)}))] + \lambda \sum_{j=1}^{n} \beta_j^2\)

Elastic Net

\(J(\boldsymbol{\beta}) = \text{Cross-entropy} + \lambda_1 \sum_{j=1}^{n} |\beta_j| + \lambda_2 \sum_{j=1}^{n} \beta_j^2\)

Algorithm Implementation

Gradient Descent Algorithm

ALGORITHM: Logistic Regression with Gradient Descent

INPUT: 
  - Training data: X (m × n), y (m × 1) with y ∈ {0,1}
  - Learning rate: α
  - Number of iterations: max_iter
  - Convergence threshold: ε

OUTPUT: Optimal parameters β

PROCEDURE:
1. Initialize β randomly (small values near zero)
2. Add bias column to X (column of ones)
3. FOR iteration = 1 to max_iter:
   a. Compute linear combination: z = X × β
   b. Compute predictions: ŷ = sigmoid(z) = 1/(1 + exp(-z))
   c. Compute cost: J = -(1/m) × [y^T × log(ŷ) + (1-y)^T × log(1-ŷ)]
   d. Compute gradients: ∇J = (1/m) × X^T × (ŷ - y)
   e. Update parameters: β = β - α × ∇J
   f. IF ||∇J|| < ε: BREAK (convergence)
4. RETURN β

COMPLEXITY:
- Time: O(max_iter × m × n)
- Space: O(m × n)

Newton-Raphson Method

ALGORITHM: Logistic Regression with Newton-Raphson

INPUT: 
  - Training data: X (m × n), y (m × 1)
  - Number of iterations: max_iter
  - Convergence threshold: ε

OUTPUT: Optimal parameters β

PROCEDURE:
1. Initialize β to small random values
2. Add bias column to X
3. FOR iteration = 1 to max_iter:
   a. Compute predictions: p = sigmoid(X × β)
   b. Compute gradient: g = X^T × (p - y)
   c. Compute Hessian: H = X^T × diag(p × (1-p)) × X
   d. Update: β = β - H^(-1) × g
   e. IF ||g|| < ε: BREAK
4. RETURN β

COMPLEXITY:
- Time: O(max_iter × n³) due to Hessian inversion
- Space: O(n²) for Hessian matrix

Prediction Algorithm

ALGORITHM: Logistic Regression Prediction

INPUT: 
  - Trained parameters: β
  - New data: X_new (k × n)
  - Decision threshold: τ (default = 0.5)

OUTPUT: Predictions and probabilities

PROCEDURE:
1. Add bias column to X_new
2. Compute linear combination: z = X_new × β
3. Compute probabilities: p = sigmoid(z)
4. Make predictions: ŷ = (p >= τ) ? 1 : 0
5. RETURN ŷ, p

COMPLEXITY:
- Time: O(k × n)
- Space: O(k)

Assumptions

Logistic regression has fewer and more relaxed assumptions compared to linear regression, making it more robust in many scenarios.

1. Binary or Ordinal Target Variable

Definition: The dependent variable should be binary (0/1) for binary logistic regression, or ordinal/categorical for multinomial variants.

Requirements:

Consequences of Violation:

Solutions:

2. Linear Relationship Between Features and Log-Odds

Definition: The log-odds (logit) of the target variable should be a linear combination of the independent variables.

Mathematical Expression: \(\ln\left(\frac{P(Y=1|X)}{P(Y=0|X)}\right) = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n\)

Detailed Explanation:

Detection Methods:

Solutions when violated:

3. Independence of Observations

Definition: Each observation should be independent of all others. No temporal or spatial correlation between residuals.

Same as linear regression: \(Cov(\epsilon_i, \epsilon_j) = 0 \text{ for all } i \neq j\)

Common Violations:

Detection Methods:

Solutions:

4. No Severe Multicollinearity

Definition: Independent variables should not be highly correlated with each other, though logistic regression is more robust to multicollinearity than linear regression.

Impact on Logistic Regression:

Detection Methods:

Solutions:

5. Large Sample Size

Definition: Logistic regression requires larger sample sizes than linear regression, especially for stable coefficient estimates and valid statistical inference.

Guidelines:

Consequences of Small Samples:

Solutions for Small Samples:

6. No Complete Separation

Definition: There should be no complete or quasi-complete separation in the data, where one or more features perfectly predict the outcome.

Types of Separation:

Consequences:

Detection Methods:

Solutions:

Assumption Validation

Implementation Available: src/ml-algos/02_logistic_regression/validate_assumptions.py

This comprehensive module provides:

Example usage:

from src.ml_algos.logistic_regression.validate_assumptions import LogisticRegressionAssumptions

# Create validator instance
validator = LogisticRegressionAssumptions()

# Check all assumptions
results = validator.check_all_assumptions(X, y)
validator.print_assumption_summary()

# Generate diagnostic plots
fig = validator.plot_assumption_diagnostics()

Evaluation Metrics

Classification Metrics

1. Confusion Matrix

For binary classification:

\[\begin{bmatrix} TN & FP \\ FN & TP \end{bmatrix}\]

Where:

2. Accuracy

\(Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\)

3. Precision (Positive Predictive Value)

\(Precision = \frac{TP}{TP + FP}\)

4. Recall (Sensitivity, True Positive Rate)

\(Recall = \frac{TP}{TP + FN}\)

5. Specificity (True Negative Rate)

\(Specificity = \frac{TN}{TN + FP}\)

6. F1-Score

\(F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{2TP}{2TP + FP + FN}\)

7. F-Beta Score

\(F_\beta = (1 + \beta^2) \times \frac{Precision \times Recall}{\beta^2 \times Precision + Recall}\)

Probabilistic Metrics

8. Log-Loss (Cross-Entropy)

\(LogLoss = -\frac{1}{m} \sum_{i=1}^{m} [y_i \log(p_i) + (1-y_i) \log(1-p_i)]\)

9. Brier Score

\(BS = \frac{1}{m} \sum_{i=1}^{m} (p_i - y_i)^2\)

10. AUC-ROC (Area Under ROC Curve)

11. AUC-PR (Area Under Precision-Recall Curve)

Statistical Measures

12. Likelihood Ratio Test

\(LR = -2 \ln\left(\frac{L_0}{L_1}\right)\)

Where $L_0$ is likelihood of null model and $L_1$ is likelihood of full model.

13. Pseudo R-squared Measures

McFadden’s R²: \(R^2_{McFadden} = 1 - \frac{\ln(L_1)}{\ln(L_0)}\)

Cox & Snell R²: \(R^2_{CS} = 1 - \left(\frac{L_0}{L_1}\right)^{2/m}\)

Nagelkerke R²: \(R^2_{N} = \frac{R^2_{CS}}{1 - (L_0)^{2/m}}\)

14. Hosmer-Lemeshow Test

Tests goodness of fit by comparing observed vs. expected frequencies across deciles of predicted probabilities.

Advantages and Disadvantages

Advantages

1. Probabilistic Output

2. No Distributional Assumptions

3. Interpretable Coefficients

4. Computational Efficiency

5. Baseline Model

6. Extension Flexibility

Disadvantages

1. Linear Decision Boundary

2. Sensitive to Outliers in Target

3. Feature Engineering Required

4. Sample Size Requirements

5. Multicollinearity Issues

6. Assumption Dependencies

Common Interview Questions

Theoretical Questions

  1. Q: Explain the difference between linear and logistic regression.

    A:

    • Linear Regression: Predicts continuous values, uses identity link function, minimizes MSE
    • Logistic Regression: Predicts probabilities for classification, uses logit link function, maximizes likelihood
  2. Q: Why can’t we use linear regression for classification?

    A:

    • Linear regression can output values outside [0,1]
    • No inherent probability interpretation
    • Sensitive to outliers affecting decision boundary
    • Assumes constant variance (homoscedasticity)
  3. Q: What is the sigmoid function and why is it used?

    A:

    • $\sigma(z) = \frac{1}{1+e^{-z}}$ maps real numbers to (0,1)
    • Smooth and differentiable everywhere
    • Monotonic and S-shaped
    • Natural interpretation as probability
  4. Q: How do you interpret logistic regression coefficients?

    A:

    • Coefficient $\beta_j$ represents change in log-odds per unit change in $x_j$
    • $e^{\beta_j}$ gives the odds ratio
    • Positive $\beta_j$: increases probability of positive class
    • Negative $\beta_j$: decreases probability of positive class
  5. Q: What is the difference between odds and probability?

    A:

    • Probability: $P = \frac{\text{successes}}{\text{total}}$, range [0,1]
    • Odds: $\frac{P}{1-P} = \frac{\text{successes}}{\text{failures}}$, range [0,∞]
    • Odds Ratio: Compares odds between groups

Technical Questions

  1. Q: Why use maximum likelihood instead of least squares?

    A:

    • MLE is more appropriate for binary outcomes
    • Provides better statistical properties
    • Naturally handles the probabilistic nature
    • Least squares can give invalid probabilities
  2. Q: How do you handle imbalanced datasets in logistic regression?

    A:

    • Resampling: SMOTE, undersampling, oversampling
    • Class weights: Inverse proportion weighting
    • Threshold tuning: Optimize for F1 or other metrics
    • Cost-sensitive learning: Different penalties for each class
  3. Q: What is complete separation and how do you handle it?

    A:

    • Perfect linear separation of classes
    • Causes infinite coefficient estimates
    • Solutions: Regularization, remove separating features, collect more data
  4. Q: Compare different regularization techniques for logistic regression.

    A:

    • L1 (Lasso): Sparse solutions, automatic feature selection
    • L2 (Ridge): Shrinks coefficients, handles multicollinearity
    • Elastic Net: Combines L1 and L2 benefits

Coding Questions

  1. Q: Implement logistic regression from scratch
  2. Q: How would you tune the decision threshold?
  3. Q: Implement cross-validation for logistic regression
  4. Q: How do you handle categorical features?

Performance Questions

  1. Q: When would you choose logistic regression over other classifiers?

    A:

    • Need probabilistic outputs
    • Interpretability is crucial
    • Linear decision boundary is sufficient
    • Fast training/prediction required
    • Baseline model for comparison
  2. Q: How do you evaluate logistic regression performance?

    A:

    • Balanced data: Accuracy, F1-score
    • Imbalanced data: Precision, Recall, AUC-PR
    • Probabilistic: Log-loss, Brier score, AUC-ROC
    • Statistical: Likelihood ratio test, pseudo R²

Practical Considerations

Data Preprocessing

1. Feature Scaling

2. Handling Categorical Variables

3. Feature Engineering

4. Missing Value Treatment

Model Selection and Tuning

1. Regularization Parameter Selection

2. Feature Selection

3. Threshold Optimization

Diagnostics and Validation

1. Model Assumptions

2. Goodness of Fit

3. Cross-Validation Strategies

Deployment Considerations

1. Model Monitoring

2. Interpretability

3. Scalability

Code Examples

Python Implementation from Scratch

Implementation Available: src/ml-algos/02_logistic_regression/logistic_regression_from_scratch.py

This comprehensive implementation includes:

Key features:

from src.ml_algos.logistic_regression.logistic_regression_from_scratch import LogisticRegressionScratch

# Initialize with different optimizers
model_gd = LogisticRegressionScratch(optimizer='gradient_descent')
model_nr = LogisticRegressionScratch(optimizer='newton_raphson')

# Train and compare
model_gd.fit(X_train, y_train)
results = model_gd.compare_optimizers(X_test, y_test)

Regularized Logistic Regression

Implementation Available: src/ml-algos/02_logistic_regression/regularized_logistic_regression.py

This module provides:

Usage example:

from src.ml_algos.logistic_regression.regularized_logistic_regression import RegularizedLogisticRegression

# L1 regularization with feature selection
model_l1 = RegularizedLogisticRegression(penalty='l1', alpha=0.01)
model_l1.fit(X_train, y_train)

# Analyze regularization path
path_analyzer = RegularizationPathAnalyzer()
best_alpha = path_analyzer.find_optimal_alpha(X_train, y_train)

Advanced Topics

Implementation Available: src/ml-algos/02_logistic_regression/advanced_logistic_regression.py

Advanced implementations include:

Key classes:

from src.ml_algos.logistic_regression.advanced_logistic_regression import (
    MultinomialLogisticRegression,
    OrdinalLogisticRegression, 
    ImbalancedLogisticRegression
)

# Multiclass classification
multinomial_model = MultinomialLogisticRegression()
multinomial_model.fit(X_train, y_train_multiclass)

# Handle imbalanced data
imbalanced_model = ImbalancedLogisticRegression(strategy='smote')
imbalanced_model.fit(X_train_imbalanced, y_train_imbalanced)

Evaluation and Diagnostics

Implementation Available: src/ml-algos/02_logistic_regression/evaluation_suite.py

Comprehensive evaluation framework featuring:

Key components:

from src.ml_algos.logistic_regression.evaluation_suite import (
    LogisticRegressionEvaluator,
    CrossValidationEvaluator,
    BootstrapEvaluator
)

# Comprehensive evaluation
evaluator = LogisticRegressionEvaluator(model)
results = evaluator.evaluate_model(X_test, y_test)
evaluator.plot_evaluation_dashboard()

# Cross-validation analysis
cv_evaluator = CrossValidationEvaluator(LogisticRegression, cv_folds=5)
cv_results = cv_evaluator.cross_validate_model(X, y)

Visualization Tools

Implementation Available: src/ml-algos/02_logistic_regression/visualization_suite.py

Comprehensive visualization toolkit including:

Usage example:

from src.ml_algos.logistic_regression.visualization_suite import LogisticRegressionVisualizer

# Create visualizer
visualizer = LogisticRegressionVisualizer(model, feature_names)

# Generate comprehensive dashboard
figures = create_comprehensive_visualization_dashboard(
    model, X_train, y_train, X_test, y_test, feature_names
)

# Individual plot types
visualizer.plot_decision_boundary_2d(X, y)
visualizer.plot_coefficient_importance()
visualizer.plot_learning_curves(X, y)
Back to ML Algorithms Back to Home