📏 Linear Regression

A thing that is almost never used in the world of Transformers and Diffusions, but without knowing it you won't get a single offer 😁

Linear Regression: A Gentle Deep Dive

Linear regression is one of the simplest yet most important supervised learning algorithms. It allows us to model the relationship between input features and a continuous target variable.

What is Linear Regression?

At its core, linear regression tries to fit a straight line (or hyperplane in higher dimensions) to our data points. The general equation is:

$ Y \approx X W + b $

X — matrix of input features (samples x features)
W — vector of weights
b — bias or intercept term

🎯 The goal: find $W$ and $b$ such that predictions $\hat{Y}$ are as close as possible to the true $Y$.

Loss Function: Mean Squared Error

We measure how “wrong” our predictions are using the Mean Squared Error (MSE):

\[ \begin{aligned} L(W, b) = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2 \\ = \frac{1}{m} \sum_{i=1}^{m} (y_i - (X_i W + b))^2 \end{aligned} \]

Where m is the number of samples. Minimizing this loss means finding $W$ and $b$ that best fit the data.

How Gradients are Calculated

To minimize MSE, we use gradient descent. The gradients of the loss with respect to $W$ are:

$$\frac{\partial L}{\partial W} = \frac{\partial}{\partial W} \frac{1}{m} \sum_{i=1}^{m} (y_i - (X_i W + b))^2$$

Simplify using chain rule:

$$\frac{\partial L}{\partial W} = \frac{1}{m} \sum_{i=1}^{m} 2 (y_i - (X_i W + b)) \cdot (-X_i)$$

Factor out the negative sign:

$$\frac{\partial L}{\partial W} = -\frac{2}{m} \sum_{i=1}^{m} (y_i - (X_i W + b)) X_i$$

Or equivalently:

$$\frac{\partial L}{\partial W} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i) X_i$$

to $b$:

Partial derivative for bias:

$$\frac{\partial L}{\partial b} = \frac{\partial}{\partial b} \frac{1}{m} \sum_{i=1}^{m} (y_i - (X_i W + b))^2$$

Using chain rule:

$$\frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} 2 (y_i - (X_i W + b)) \cdot (-1)$$

Factor out the negative sign and simplify:

$$\frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i)$$

🎯 Final gradients: 🎯

$ \frac{\partial L}{\partial W} = \frac{1}{m} X^T (\hat{Y} - Y) $

$ \frac{\partial L}{\partial b} = \frac{1}{m} (\hat{Y} - Y) $

These gradients tell us the direction to update $W$ and $b$ to reduce the loss. Conceptually:

dw: how each feature weight should change
db: how the intercept should change

Optimization via Gradient Descent

We iteratively update parameters:

$ W := W - \alpha \frac{\partial L}{\partial W} $
$ b := b - \alpha \frac{\partial L}{\partial b} $

Here, $\alpha$ is the learning rate — a small step size that controls how fast we move toward the minimum.

Python Implementation


import numpy as np

class LinearRegression:
    def __init__(self, n_dims, lr=0.01):
        # n_dims - number of features 
        self.lr = lr
        self.w = np.random.randn(n_dims, 1)
        self.b = 0

    def fit(self, X, Y, n_epoch=1000):
        Y = Y.reshape(-1, 1)  # Y - column
        for epoch in range(n_epoch):
            dw, db = self.grad(X, Y)
            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        # self.w - column, (n_dim, 1)
        return np.dot(X, self.w) + self.b

    def grad(self, X, Y):
        m = X.shape[0]
        y_hat = self.predict(X)
        error = y_hat - Y
        dw = (1/m) * np.dot(X.T, error)
        db = (1/m) * np.sum(error)
        return dw, db

Practical Use Cases

Linear regression is everywhere:

Predicting house prices based on features like area, rooms, location.
Forecasting sales, stock prices, or any continuous quantity.
In machine learning pipelines as a baseline model before moving to complex algorithms.
Understanding relationships between variables — the coefficients $W$ show feature importance.

Even though it's simple, linear regression is the foundation for many advanced techniques.

Published on August 22, 2025 Author: Vitaly