Notes

Personal notes on various topics

View on GitHub

Linear Regression: A Comprehensive Guide

Table of Contents

  1. Introduction
  2. Mathematical Foundation
  3. Types of Linear Regression
  4. Algorithm Implementation
  5. Assumptions
  6. Evaluation Metrics
  7. Advantages and Disadvantages
  8. Common Interview Questions
  9. Practical Considerations
  10. Code Examples

Introduction

Linear Regression is one of the most fundamental and widely-used algorithms in machine learning and statistics. It models the relationship between a dependent variable (target) and one or more independent variables (features) using a linear equation. Despite its simplicity, it serves as the foundation for understanding more complex algorithms and remains highly effective for many real-world problems.

Key Concepts:

Mathematical Foundation

Simple Linear Regression

For a single feature, the linear regression model is:

\[y = \beta_0 + \beta_1 x + \epsilon\]

Where:

Multiple Linear Regression

For multiple features, the model extends to:

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon\]

In matrix form:

\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

Where:

Cost Function

The goal is to minimize the Mean Squared Error (MSE):

\[J(\boldsymbol{\beta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\boldsymbol{\beta}}(x^{(i)}) - y^{(i)})^2\]

Where:

In matrix form:

\[J(\boldsymbol{\beta}) = \frac{1}{2m} (\mathbf{X}\boldsymbol{\beta} - \mathbf{y})^T (\mathbf{X}\boldsymbol{\beta} - \mathbf{y})\]

Analytical Solution (Normal Equation)

The optimal parameters can be found analytically:

\[\boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Conditions for existence:

Types of Linear Regression

1. Ordinary Least Squares (OLS)

2. Ridge Regression (L2 Regularization)

Cost function with L2 penalty:

\[J(\boldsymbol{\beta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\boldsymbol{\beta}}(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \beta_j^2\]

Where $\lambda$ is the regularization parameter.

3. Lasso Regression (L1 Regularization)

Cost function with L1 penalty:

\[J(\boldsymbol{\beta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\boldsymbol{\beta}}(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\beta_j|\]

4. Elastic Net

Combines L1 and L2 regularization:

\[J(\boldsymbol{\beta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_{\boldsymbol{\beta}}(x^{(i)}) - y^{(i)})^2 + \lambda_1 \sum_{j=1}^{n} |\beta_j| + \lambda_2 \sum_{j=1}^{n} \beta_j^2\]

Algorithm Implementation

Gradient Descent Algorithm

ALGORITHM: Linear Regression with Gradient Descent

INPUT: 
  - Training data: X (m × n), y (m × 1)
  - Learning rate: α
  - Number of iterations: max_iter
  - Convergence threshold: ε

OUTPUT: Optimal parameters β

PROCEDURE:
1. Initialize β randomly or to zeros
2. Add bias column to X (column of ones)
3. FOR iteration = 1 to max_iter:
   a. Compute predictions: ŷ = X × β
   b. Compute cost: J = (1/2m) × ||ŷ - y||²
   c. Compute gradients: ∇J = (1/m) × X^T × (ŷ - y)
   d. Update parameters: β = β - α × ∇J
   e. IF ||∇J|| < ε: BREAK (convergence)
4. RETURN β

COMPLEXITY:
- Time: O(max_iter × m × n)
- Space: O(m × n)

Normal Equation Algorithm

ALGORITHM: Linear Regression with Normal Equation

INPUT: 
  - Training data: X (m × n), y (m × 1)

OUTPUT: Optimal parameters β

PROCEDURE:
1. Add bias column to X (column of ones)
2. Compute X^T × X
3. Check if X^T × X is invertible
4. IF invertible:
   a. Compute β = (X^T × X)^(-1) × X^T × y
5. ELSE:
   a. Use pseudo-inverse or regularization
6. RETURN β

COMPLEXITY:
- Time: O(n³) for matrix inversion
- Space: O(n²)

Assumptions

Linear regression relies on several critical assumptions that must be satisfied for the model to produce reliable and valid results. Violating these assumptions can lead to biased estimates, incorrect confidence intervals, and poor predictive performance.

1. Linearity

Definition: The relationship between the independent variables (features) and the dependent variable (target) is linear. This means the change in the target variable is proportional to the change in the feature variables.

Mathematical Expression: \(E[Y|X] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p\)

Detailed Explanation:

Consequences of Violation:

Detection Methods:

Solutions when violated:

2. Independence of Observations

Definition: Each observation in the dataset is independent of all other observations. The residual of one observation should not be predictable from the residuals of other observations.

Mathematical Expression: \(Cov(\epsilon_i, \epsilon_j) = 0 \text{ for all } i \neq j\)

Detailed Explanation:

Consequences of Violation:

Common Violations:

Detection Methods:

Solutions when violated:

3. Homoscedasticity (Constant Variance)

Definition: The variance of the residuals is constant across all levels of the independent variables. The spread of residuals should be the same regardless of the predicted values.

Mathematical Expression: \(Var(\epsilon_i) = \sigma^2 \text{ for all } i\)

Detailed Explanation:

Consequences of Violation (Heteroscedasticity):

Detection Methods:

Solutions when violated:

4. Normality of Residuals

Definition: The residuals (error terms) are normally distributed with mean zero. This assumption is crucial for hypothesis testing and confidence intervals.

Mathematical Expression: \(\epsilon_i \sim N(0, \sigma^2)\)

Detailed Explanation:

Consequences of Violation:

Detection Methods:

Solutions when violated:

5. No Multicollinearity

Definition: The independent variables are not highly correlated with each other. Each feature should provide unique information about the target variable.

Mathematical Expression: Perfect multicollinearity: $X_j = \alpha + \beta X_k$ for some features j and k

Detailed Explanation:

Consequences of Violation:

Types of Multicollinearity:

Detection Methods:

Solutions when violated:

Assumption Validation with Python Examples

Here’s comprehensive Python code to check all assumptions with visualizations:

Linear Regression Assumptions Validation Code

This comprehensive code provides:

  1. Visual diagnostics for all five assumptions
  2. Statistical tests with interpretation guidelines
  3. Example datasets showing different violation scenarios
  4. Color-coded warnings for problematic values
  5. Summary statistics with actionable insights

The graphs generated will clearly show:

Following are the generated graphs for different scenarios:

01. Well-behaved Data Well-behaved Data

02. Heteroscedasticity Violation Heteroscedasticity Violation

03. Non-linearity Violation Non-linearity Violation

Quick Reference for Assumption Checking

Assumption Quick Check Warning Signs Solutions
Linearity Residual plots Curved patterns Polynomial features, transformations
Independence Durbin-Watson DW ≠ 2.0 GLS, robust SE, time series models
Homoscedasticity Residual spread Funnel patterns WLS, robust SE, transformations
Normality Q-Q plots Skewed residuals Transformations, robust methods
No Multicollinearity VIF values VIF > 10 Remove features, PCA, Ridge regression

Evaluation Metrics

1. Mean Squared Error (MSE)

\(MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2\)

2. Root Mean Squared Error (RMSE)

\(RMSE = \sqrt{MSE}\)

3. Mean Absolute Error (MAE)

\(MAE = \frac{1}{m} \sum_{i=1}^{m} |y_i - \hat{y}_i|\)

4. R-squared (Coefficient of Determination)

\(R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{m} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{m} (y_i - \bar{y})^2}\)

5. Adjusted R-squared

\(R^2_{adj} = 1 - \frac{(1-R^2)(m-1)}{m-p-1}\)

Where $p$ is the number of features.

Advantages and Disadvantages

Advantages

Disadvantages

Common Interview Questions

Technical Questions

  1. Q: What is the difference between Ridge and Lasso regression?
    • Ridge: Uses L2 regularization, shrinks coefficients toward zero but doesn’t set them to exactly zero
    • Lasso: Uses L1 regularization, can set coefficients to exactly zero (feature selection)
  2. Q: When would you use gradient descent vs. normal equation?
    • Gradient Descent: Large datasets (m > 10,000), when X^TX is not invertible
    • Normal Equation: Small to medium datasets, when analytical solution is feasible
  3. Q: How do you handle multicollinearity?
    • Remove highly correlated features
    • Use regularization (Ridge/Lasso)
    • Principal Component Analysis (PCA)
    • Variance Inflation Factor (VIF) analysis
  4. Q: What if linear regression assumptions are violated?
    • Non-linearity: Polynomial features, kernel methods
    • Heteroscedasticity: Weighted least squares, robust standard errors
    • Non-normality: Robust regression methods
    • Autocorrelation: Generalized least squares

Coding Questions

  1. Q: Implement linear regression from scratch
  2. Q: How would you detect and handle outliers?
  3. Q: Implement cross-validation for model selection

Practical Considerations

Feature Engineering

Model Selection

Diagnostics

Code Examples

Python Implementation from Scratch

Linear Regression from Scratch Code

Regularized Linear Regression

Ridge Regression from Scratch Code

Back to ML Algorithms Back to Home