πŸ“ Linear Regression

A thing that is almost never used in the world of Transformers and Diffusions, but without knowing it you won't get a single offer 😁

Regression illustration

Linear Regression: A Gentle Deep Dive

Linear regression is one of the simplest yet most important supervised learning algorithms. It allows us to model the relationship between input features and a continuous target variable.

What is Linear Regression?

At its core, linear regression tries to fit a straight line (or hyperplane in higher dimensions) to our data points. The general equation is:

\( Y \approx X W + b \)

  • X β€” matrix of input features (samples Γ— features)
  • W β€” vector of weights
  • b β€” bias or intercept term

🎯 The goal: find \(W\) and \(b\) such that predictions \(\hat{Y}\) are as close as possible to the true \(Y\).

Loss Function: Mean Squared Error

We measure how β€œwrong” our predictions are using the Mean Squared Error (MSE):

\[ \begin{aligned} L(W, b) = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{y}_i)^2 \\ = \frac{1}{m} \sum_{i=1}^{m} (y_i - (X_i W + b))^2 \end{aligned} \]

Where m is the number of samples. Minimizing this loss means finding \(W\) and \(b\) that best fit the data.

How Gradients are Calculated

To minimize MSE, we use gradient descent. The gradients of the loss with respect to \(W\) are:

$$\frac{\partial L}{\partial W} = \frac{\partial}{\partial W} \frac{1}{m} \sum_{i=1}^{m} (y_i - (X_i W + b))^2$$

Simplify using chain rule:

$$\frac{\partial L}{\partial W} = \frac{1}{m} \sum_{i=1}^{m} 2 (y_i - (X_i W + b)) \cdot (-X_i)$$

Factor out the negative sign:

$$\frac{\partial L}{\partial W} = -\frac{2}{m} \sum_{i=1}^{m} (y_i - (X_i W + b)) X_i$$

Or equivalently:

$$\frac{\partial L}{\partial W} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i) X_i$$

to \(b\):

Partial derivative for bias:

$$\frac{\partial L}{\partial b} = \frac{\partial}{\partial b} \frac{1}{m} \sum_{i=1}^{m} (y_i - (X_i W + b))^2$$

Using chain rule:

$$\frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} 2 (y_i - (X_i W + b)) \cdot (-1)$$

Factor out the negative sign and simplify:

$$\frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i)$$

🎯 Final gradients: 🎯

\( \frac{\partial L}{\partial W} = \frac{1}{m} X^T (\hat{Y} - Y) \)

\( \frac{\partial L}{\partial b} = \frac{1}{m} (\hat{Y} - Y) \)

These gradients tell us the direction to update \(W\) and \(b\) to reduce the loss. Conceptually:

  • dw: how each feature weight should change
  • db: how the intercept should change

Optimization via Gradient Descent

We iteratively update parameters:

\( W := W - \alpha \frac{\partial L}{\partial W} \)
\( b := b - \alpha \frac{\partial L}{\partial b} \)

Here, \(\alpha\) is the learning rate β€” a small step size that controls how fast we move toward the minimum.

Python Implementation


import numpy as np

class LinearRegression:
    def __init__(self, n_dims, lr=0.01):
        # n_dims - number of features 
        self.lr = lr
        self.w = np.random.randn(n_dims, 1)
        self.b = 0

    def fit(self, X, Y, n_epoch=1000):
        Y = Y.reshape(-1, 1)  # Y - column
        for epoch in range(n_epoch):
            dw, db = self.grad(X, Y)
            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        # self.w - column, (n_dim, 1)
        return np.dot(X, self.w) + self.b

    def grad(self, X, Y):
        m = X.shape[0]
        y_hat = self.predict(X)
        error = y_hat - Y
        dw = (1/m) * np.dot(X.T, error)
        db = (1/m) * np.sum(error)
        return dw, db

Practical Use Cases

Linear regression is everywhere:

  • Predicting house prices based on features like area, rooms, location.
  • Forecasting sales, stock prices, or any continuous quantity.
  • In machine learning pipelines as a baseline model before moving to complex algorithms.
  • Understanding relationships between variables β€” the coefficients \(W\) show feature importance.

Even though it's simple, linear regression is the foundation for many advanced techniques.

Published on August 22, 2025 Author: Vitaly