Regression Analysis and Modeling
Overview of Regression
Regression is a supervised Machine Learning technique used to model the relationship between a continuous target variable and one or more explanatory features. The primary goal is to estimate a continuous value, such as predicting a car's CO2 emissions based on engine size or forecasting sales figures. Classical statistical methods include linear and polynomial regression, whilst modern algorithms also include random forests, XGBoost, and neural networks.
Linear Regression Models
Linear regression attempts to model a linear relationship between independent variables and a dependent variable. Simple linear regression models a linear relationship between one independent variable and a dependent variable, such as predicting CO2 emissions based on engine size.
-
A scatter plot can illustrate the correlation between the independent variable (engine size) and the dependent variable (CO2 emissions).
-
Simple Linear Regression: This involves a single independent variable estimating a dependent variable. It fits a straight line through the data using the equation
, where is the intercept (bias) and is the slope. It is easy to interpret and fast to calculate but can be overly simplistic for complex data and sensitive to outliers. -
Multiple Linear Regression: This extends the simple model to include two or more independent variables. The resulting model describes a plane (two features) or a hyperplane (more than two features).
- Categorical Data: Non-numerical variables must be converted; binary variables (like car type) become 0 or 1, while variables with multiple classes are transformed into separate Boolean features.
- Collinearity: A major pitfall occurs when independent variables are correlated with each other (collinear), making them no longer independent. This complicates "what-if" scenarios where one variable is changed while holding others constant.
-
Performance Metric (MSE): The accuracy of linear models is typically evaluated using the Mean Squared Error (MSE). The model aims to minimise the average of the squared residual errors (the vertical distance between actual data points and the predicted line). The method used to find the parameters that minimise this error is often called Ordinary Least Squares (OLS).
Nonlinear and Polynomial Regression
When data follows a complex trend (e.g., a smoothed curve rather than a straight line), linear models may underfit the data. Nonlinear regression models the relationship between a dependent variable and one or more independent variables using nonlinear equations, such as polynomial, exponential, or logarithmic functions.
- Nonlinear Regression: This models relationships using non-linear equations such as exponential, logarithmic, or sinusoidal functions. It is essential for capturing phenomena like exponential growth (e.g., GDP over time) or the law of diminishing returns (e.g., productivity levelling off after long hours).
- Polynomial Regression: This is a specific form of non-linear regression where the data is fitted to polynomial expressions of the features (e.g.,
). Interestingly, because the model is a linear combination of these higher-power features, it can still be solved using linear regression techniques. - Overfitting Risk: A polynomial of a sufficiently high degree can pass through every data point, capturing random noise rather than the underlying trend. This "memorisation" of training data is known as overfitting.
- Applications: Various real world phenomena such as expontential growth (e.g. GDP), logarithmic relationships (e.g. diminishing returns) and periodic patterns (e.g. seasonal variations).
Logistic Regression
Despite the name, logistic regression is a binary classifier used to predict the probability that an observation belongs to a specific class (e.g., True/False, 0/1).
- The Sigmoid Function: Linear regression is unsuitable here because its predictions can range from negative infinity to positive infinity. Logistic regression solves this by using the sigmoid (logit) function, which compresses any input value into a range between 0 and 1.
- Decision Boundary: The output represents a probability. A threshold (typically 0.5) acts as a decision boundary; if the predicted probability is above this, the observation is classified as 1 (e.g., "Yes"), and if below, it is classified as 0 (e.g., "No").
- Log Loss (Cost Function): Unlike linear regression which minimises MSE, logistic regression minimises Log Loss. This function penalises confident but incorrect predictions heavily (e.g., predicting a 99% probability of an event that does not occur).
Training and Optimisation Algorithms
Finding the best model parameters (
- Ordinary Least Squares (OLS): Used primarily for linear regression, this uses linear algebra to calculate the optimal values directly. It is fast for small datasets but does not require iteration.
- Gradient Descent: An iterative optimisation algorithm used when an analytical solution is difficult. It starts with random parameters and adjusts them by moving in the direction of the steepest descent of the cost function (minimising error).
- Learning Rate: This factor controls the size of the steps taken during optimization. Steps that are too big may miss the minimum, while steps that are too small make convergence slow.
- Stochastic Gradient Descent (SGD): A variation of gradient descent that uses a random subset of data rather than the whole dataset for each step. It is faster and scales better for large datasets, though it may wander around the global minimum rather than settling instantly.
Applications
Regression analysis is applied across various industries:
- Sales & Retail: Forecasting yearly sales based on leads or order history.
- Maintenance: Predicting when machinery will require maintenance to prevent failure.
- Healthcare: Predicting the spread of infectious diseases or the likelihood of a patient developing a condition like diabetes based on health metrics.
- Customer Retention: Logistic regression is widely used to predict "churn"βthe likelihood that a customer will cancel a subscription.
- Environmental: Estimating rainfall or the probability of wildfires.
Python Simple Linear Regression
Python Multiple Linear Regression
Python Logistic Regression