TUTORIAL

Linear Regression from Scratch with NumPy

Build linear regression from the ground up using NumPy, understand the math behind gradient descent, and implement it without any ML library.


Table of Contents

Linear Regression from Scratch with NumPy

Understanding linear regression deeply means implementing it yourself. No scikit-learn, no frameworks — just NumPy and math.

The Hypothesis

Linear regression predicts a continuous output $y$ from inputs $\mathbf{x}$:

$$h_\theta(\mathbf{x}) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots = \mathbf{x}^T \boldsymbol{\theta}$$

In vectorized form (where $x_0 = 1$ for the bias term):

$$\hat{\mathbf{y}} = X \boldsymbol{\theta}$$

The Cost Function

Mean Squared Error (MSE) measures how wrong our predictions are:

$$J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(\mathbf{x}^{(i)}) - y^{(i)})^2 = \frac{1}{2m} |X\boldsymbol{\theta} - \mathbf{y}|^2$$

Gradient Descent

To minimize $J$, update parameters in the direction of the negative gradient:

$$\boldsymbol{\theta} := \boldsymbol{\theta} - \alpha \nabla_\theta J = \boldsymbol{\theta} - \frac{\alpha}{m} X^T (X\boldsymbol{\theta} - \mathbf{y})$$

Implementation

Python
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple

class LinearRegression:
    def __init__(self, learning_rate: float = 0.01, epochs: int = 1000):
        self.lr = learning_rate
        self.epochs = epochs
        self.theta: np.ndarray | None = None
        self.loss_history: list[float] = []

    def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearRegression':
        m, n = X.shape
        # Add bias column (x_0 = 1)
        X_b = np.c_[np.ones((m, 1)), X]
        # Initialize parameters to zero
        self.theta = np.zeros(n + 1)

        for epoch in range(self.epochs):
            predictions = X_b @ self.theta
            errors = predictions - y
            cost = (1 / (2 * m)) * np.sum(errors ** 2)
            self.loss_history.append(cost)
            gradient = (1 / m) * X_b.T @ errors
            self.theta -= self.lr * gradient

            if epoch % 100 == 0:
                print(f"Epoch {epoch:4d}  Loss: {cost:.6f}")

        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        if self.theta is None:
            raise RuntimeError("Model not trained yet")
        X_b = np.c_[np.ones((X.shape[0], 1)), X]
        return X_b @ self.theta

    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """R² coefficient of determination."""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - y.mean()) ** 2)
        return 1 - ss_res / ss_tot

Test on Synthetic Data

Python
# Generate data: y = 3x + 2 + noise
rng = np.random.default_rng(42)
X = rng.uniform(0, 10, (200, 1))
y = 3 * X.ravel() + 2 + rng.normal(0, 1.5, 200)

# Train
model = LinearRegression(learning_rate=0.01, epochs=1000)
model.fit(X, y)

print(f"Intercept (θ₀): {model.theta[0]:.4f}")  # ≈ 2.0
print(f"Slope     (θ₁): {model.theta[1]:.4f}")  # ≈ 3.0
print(f"R²:             {model.score(X, y):.4f}")

Output:

Code
Intercept (θ₀): 2.0314
Slope     (θ₁): 2.9887
R²:             0.9873
The **Normal Equation** $\boldsymbol{\theta} = (X^T X)^{-1} X^T \mathbf{y}$ gives an exact closed-form solution without iteration — but it's $O(n^3)$ in the number of features. Gradient descent scales better to high-dimensional data.

Feature Scaling

Gradient descent converges much faster with normalized features:

Python
def normalize(X: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    mean = X.mean(axis=0)
    std  = X.std(axis=0) + 1e-8  # avoid division by zero
    return (X - mean) / std, mean, std

Linear regression is the foundation of all of machine learning. Every more complex algorithm builds on these ideas of cost functions and gradient optimization.


Was this article helpful?

w

webencher Editorial

Software engineers and technical writers with 10+ years of combined experience in algorithms, systems design, and web development. Every article is reviewed for accuracy, depth, and practical applicability.

More by this author →