Why Linear Regression Is More Profound Than You Think: A Journey Through Estimation Theory

Linear regression is often the first algorithm encountered in machine learning, prized for its simplicity and interpretability. But this apparent simplicity is deceptive. Beneath the surface, linear regression serves as a gateway to a rich, interconnected world of geometric, probabilistic, and dynamic systems concepts.

This post revisits the classic least-squares problem to uncover the deeper mathematical structures at its core. We will journey through four distinct but convergent perspectives:

Geometric Intuition: Viewing regression as a projection in vector spaces.
Probabilistic Rigor: Framing it as a maximum likelihood estimation problem and invoking the Gauss-Markov theorem.
Bayesian Inference: Understanding regularization as the incorporation of prior beliefs.
Sequential Estimation: Evolving the batch solution into a recursive one suitable for real-time systems, leading us to the Kalman Filter and its surprising links to reinforcement learning.

1. The Geometric View: Regression as Projection

The most elegant and fundamental perspective on linear regression is geometric. We are given a set of input vectors $\boldsymbol{\phi}(x_i) \in \mathbb{R}^d$ and corresponding scalar outputs $y_i$ . Our goal is to find a linear combination of the features that best approximates the outputs.

Let's assemble our data:

The design matrix $X \in \mathbb{R}^{N \times d}$ , whose rows are the feature vectors $\boldsymbol{\phi}(x_i)^\top$ .
The target vector $\mathbf{y} \in \mathbb{R}^N$ .
The parameter vector $\mathbf{w} \in \mathbb{R}^d$ .

Our model predicts $\hat{\mathbf{y}} = X\mathbf{w}$ . The vector $\hat{\mathbf{y}}$ is, by definition, a linear combination of the columns of $X$ . This means $\hat{\mathbf{y}}$ must lie in the column space of $X$ , denoted $\mathcal{C}(X)$ .

The ordinary least squares (OLS) problem seeks to find the parameter vector $\mathbf{w}^*$ that minimizes the squared Euclidean distance between the prediction and the true targets:

\mathbf{w}^* = \underset{\mathbf{w}}{\arg\min}\; \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2

Geometrically, this is equivalent to finding the vector $\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}^*$ in the column space $\mathcal{C}(\mathbf{X})$ that is closest to $\mathbf{y}$ . This vector is the orthogonal projection of $\mathbf{y}$ onto $\mathcal{C}(\mathbf{X})$ .

This geometric condition implies that the residual vector, $\mathbf{e} = \mathbf{y} - \mathbf{X}\mathbf{w}^*$ , must be orthogonal to every vector in $\mathcal{C}(\mathbf{X})$ . This can be stated succinctly as:

\mathbf{X}^\top (\mathbf{y} - \mathbf{X}\mathbf{w}^*) = \mathbf{0}

Rearranging this gives the celebrated normal equations:

\mathbf{X}^\top \mathbf{X} \mathbf{w}^* = \mathbf{X}^\top \mathbf{y}

If the columns of $\mathbf{X}$ are linearly independent, the Gram matrix $\mathbf{X}^\top \mathbf{X}$ is invertible, and we obtain the unique solution:

\boxed{\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}}

The Pseudoinverse and SVD

When $\mathbf{X}^\top \mathbf{X}$ is singular or ill-conditioned, we use the Moore-Penrose pseudoinverse $\mathbf{X}^+$ . Via the singular value decomposition (SVD), write:

\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top

where $\mathbf{U} \in \mathbb{R}^{N \times N}$ and $\mathbf{V} \in \mathbb{R}^{d \times d}$ are orthogonal, and $\mathbf{\Sigma} \in \mathbb{R}^{N \times d}$ contains the singular values $\sigma_i \geq 0$ . The pseudoinverse is:

\mathbf{X}^+ = \mathbf{V} \mathbf{\Sigma}^+ \mathbf{U}^\top

where $\mathbf{\Sigma}^+$ replaces each nonzero $\sigma_i$ with $1/\sigma_i$ . The solution $\mathbf{w}^* = \mathbf{X}^+ \mathbf{y}$ exists for any $\mathbf{X}$ and gives the minimum-norm solution when the system is underdetermined.

Conditioning: The condition number $\kappa(\mathbf{X}^\top \mathbf{X}) = (\sigma_{\max}/\sigma_{\min})^2$ measures sensitivity to perturbations. Large $\kappa$ indicates ill-conditioning, amplifying noise and motivating regularization.

The Hat Matrix and Leverage

The hat matrix (or projection matrix) is:

\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top

It projects $\mathbf{y}$ onto $\mathcal{C}(\mathbf{X})$ : $\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}$ . The diagonal elements $h_{ii}$ are the leverage values, quantifying the influence of observation $i$ on its own prediction. High leverage indicates potential outliers or influential points. The residual $\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{y}$ is orthogonal to $\mathcal{C}(\mathbf{X})$ , as required.

2. The Probabilistic View: Maximum Likelihood & The Gauss-Markov Theorem

The geometric view is deterministic. To introduce statistical properties, we model the data-generating process. A standard assumption is that the outputs are generated by a linear model with additive, zero-mean Gaussian noise:

$y_i = \mathbf{w}^\top \boldsymbol{\phi}(x_i) + v_i, \qquad v_i \sim \mathcal{N}(0, \sigma^2)$

In vector form, this is $\mathbf{y} = X\mathbf{w} + \mathbf{v}$ , where $\mathbf{v} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I)$ .

From this, we can ask: what parameter vector $\mathbf{w}$ most likely generated the observed data $\mathbf{y}$ ? The likelihood function gives us the probability density of the data given the parameters:

p(\mathbf{y} \mid \mathbf{w}) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{w}^\top \boldsymbol{\phi}(x_i))^2}{2\sigma^2}\right) \propto \exp\left(-\frac{1}{2\sigma^2} \|\mathbf{y} - X\mathbf{w}\|^2\right)

To find the Maximum Likelihood Estimator (MLE), we typically maximize the log-likelihood, which is equivalent and mathematically simpler:

\log p(\mathbf{y} \mid \mathbf{w}) = C - \frac{1}{2\sigma^2} \|\mathbf{y} - X\mathbf{w}\|^2

Maximizing this expression is equivalent to minimizing $\|\mathbf{y} - X\mathbf{w}\|^2$ . The solution is identical to the OLS estimator:

\boxed{\hat{\mathbf{w}}_{\mathrm{ML}} = (X^\top X)^{-1} X^\top \mathbf{y}}

This reveals that OLS is not merely a convenient heuristic; it is the statistically optimal estimator under the Gaussian noise assumption. This allows us to analyze its properties:

Unbiasedness: The estimator is correct on average: $\mathbb{E}[\hat{\mathbf{w}}_{\mathrm{ML}}] = \mathbf{w}$ .
Covariance: The uncertainty in our estimate is $\mathrm{Cov}(\hat{\mathbf{w}}_{\mathrm{ML}}) = \sigma^2 (X^\top X)^{-1}$ .
Efficiency: The estimator achieves the Cramér-Rao Lower Bound, meaning it is the most precise unbiased estimator possible.

Crucially, the Gauss-Markov Theorem provides a weaker but more general guarantee: even if the noise is not Gaussian, as long as it is uncorrelated and has zero mean and constant variance (homoscedastic), the OLS estimator is the Best Linear Unbiased Estimator (BLUE). It has the minimum variance among all linear unbiased estimators.

Gauss-Markov Theorem: A Sketch

Setup: Consider any linear unbiased estimator of the form $\tilde{\mathbf{w}} = \mathbf{A}\mathbf{y}$ where $\mathbf{A} \in \mathbb{R}^{d \times N}$ . For unbiasedness:

\mathbb{E}[\tilde{\mathbf{w}}] = \mathbf{A}\mathbb{E}[\mathbf{y}] = \mathbf{A}\mathbf{X}\mathbf{w} = \mathbf{w} \quad \forall \mathbf{w}

This requires $\mathbf{A}\mathbf{X} = \mathbf{I}$ .

Variance: The covariance of $\tilde{\mathbf{w}}$ is:

\mathrm{Cov}(\tilde{\mathbf{w}}) = \mathbf{A} \mathrm{Cov}(\mathbf{y}) \mathbf{A}^\top = \sigma^2 \mathbf{A}\mathbf{A}^\top

Optimality: Among all matrices $\mathbf{A}$ satisfying $\mathbf{A}\mathbf{X} = \mathbf{I}$ , the choice $\mathbf{A}^* = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top$ minimizes the variance (in the sense of the Loewner ordering). This gives:

\mathrm{Cov}(\hat{\mathbf{w}}_{\mathrm{OLS}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}

Thus, OLS is BLUE.

3. The Bayesian View: From Regularization to Priors

The MLE perspective assumes we know nothing about $\mathbf{w}$ beforehand. Bayesian inference allows us to incorporate prior beliefs. We treat $\mathbf{w}$ as a random variable with a prior distribution $p(\mathbf{w})$ .

Let's assume a zero-mean Gaussian prior for $\mathbf{w}$ , which encodes a belief that smaller parameter values are more likely:

\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_p)

Using Bayes' rule, the posterior distribution of $\mathbf{w}$ after observing the data is:

p(\mathbf{w} \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{w}) p(\mathbf{w})}{p(\mathbf{y})} \propto p(\mathbf{y} \mid \mathbf{w}) p(\mathbf{w})

The Maximum A Posteriori (MAP) estimate maximizes this posterior probability. Taking the negative log of the posterior gives:

-\log p(\mathbf{w} \mid \mathbf{y}) \propto \frac{1}{2\sigma^2} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{1}{2} \mathbf{w}^\top \Sigma_p^{-1} \mathbf{w}

If we assume a simple spherical prior $\Sigma_p = \tau^2 \mathbf{I}$ , the MAP estimate becomes the minimizer of:

L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{\sigma^2}{\tau^2} \|\mathbf{w}\|^2

The solution to this is:

\boxed{\hat{\mathbf{w}}_{\mathrm{MAP}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}} \quad \text{where} \quad \lambda = \frac{\sigma^2}{\tau^2}

This is exactly Ridge Regression. The regularization term, often seen as an ad-hoc penalty to prevent overfitting, is now revealed to be the consequence of a Gaussian prior on the parameters.

Posterior Covariance: The full posterior distribution is also Gaussian, with covariance:

\Sigma_{\text{post}} = \left(\frac{1}{\sigma^2}\mathbf{X}^\top \mathbf{X} + \Sigma_p^{-1}\right)^{-1}

For the isotropic case, this becomes:

\Sigma_{\text{post}} = \sigma^2 (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1}

Notice that regularization reduces posterior uncertainty (smaller eigenvalues) at the cost of introducing bias.

Bias-Variance Decomposition

For a new test point $\mathbf{x}$ , the prediction $\hat{y} = \mathbf{x}^\top \hat{\mathbf{w}}$ has expected squared error that decomposes into three components:

\mathbb{E}[(y - \hat{y})^2] = \text{Bias}^2 + \text{Variance} + \sigma^2

where:

Bias: $(\mathbf{x}^\top \mathbf{w} - \mathbb{E}[\mathbf{x}^\top \hat{\mathbf{w}}])^2$ measures systematic error.
Variance: $\mathbb{E}[(\mathbf{x}^\top \hat{\mathbf{w}} - \mathbb{E}[\mathbf{x}^\top \hat{\mathbf{w}}])^2]$ measures sensitivity to training data.
Irreducible error: $\sigma^2$ from the noise.

For OLS, bias is zero but variance can be large (especially when $\kappa(\mathbf{X}^\top \mathbf{X})$ is large). Ridge regression introduces bias by shrinking coefficients, but reduces variance, often improving overall test error—the fundamental bias-variance tradeoff.

Other priors lead to different regularizers. For instance, a Laplacian prior $p(\mathbf{w}) \propto \exp(-\alpha \|\mathbf{w}\|_1)$ results in an $L_1$ penalty, leading to LASSO regression, which encourages sparse solutions.

4. The Information-Theoretic View

Linear regression connects deeply to information theory through the Fisher information matrix and mutual information in linear Gaussian channels.

Fisher Information and the Cramér-Rao Bound

The Fisher information matrix quantifies how much information the data $\mathbf{y}$ carries about the parameters $\mathbf{w}$ . For the linear Gaussian model $\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{v}$ with $\mathbf{v} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$ , it is:

\mathcal{I}(\mathbf{w}) = \frac{1}{\sigma^2} \mathbf{X}^\top \mathbf{X}

The Cramér-Rao Lower Bound (CRLB) states that for any unbiased estimator $\hat{\mathbf{w}}$ :

\mathrm{Cov}(\hat{\mathbf{w}}) \succeq \mathcal{I}(\mathbf{w})^{-1} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}

The MLE/OLS estimator achieves this bound exactly, making it efficient—no unbiased estimator can do better.

Mutual Information in Linear Gaussian Channels

Consider $\mathbf{w}$ as a random signal with prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_w)$ , and the observation model $\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{v}$ . The mutual information between $\mathbf{w}$ and $\mathbf{y}$ is:

I(\mathbf{w}; \mathbf{y}) = \frac{1}{2}\log\det\left(\mathbf{I} + \frac{1}{\sigma^2}\mathbf{X}\Sigma_w \mathbf{X}^\top\right)

This quantifies how much observing $\mathbf{y}$ reduces uncertainty about $\mathbf{w}$ . Via the SVD of $\mathbf{X}$ , this depends on the singular values $\sigma_i$ and the prior variance along each singular direction. Regularization (small $\Sigma_w$ ) reduces mutual information but improves numerical stability and generalization.

5. Signal Estimation Theory Connections

Linear regression is a special case of broader signal estimation frameworks.

Generalized Least Squares (GLS)

When the noise is correlated with covariance $\mathrm{Cov}(\mathbf{v}) = \mathbf{\Sigma}_v \neq \sigma^2 \mathbf{I}$ , OLS is no longer BLUE. The Generalized Least Squares (GLS) estimator is:

\hat{\mathbf{w}}_{\text{GLS}} = (\mathbf{X}^\top \mathbf{\Sigma}_v^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Sigma}_v^{-1} \mathbf{y}

GLS is BLUE for correlated noise, transforming the problem via $\mathbf{\Sigma}_v^{-1/2}$ to recover the OLS structure.

Linear Minimum Mean Square Error (LMMSE) Estimation

When $\mathbf{w}$ is random with known prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_w)$ , the LMMSE estimator (also the Bayesian posterior mean) is:

\hat{\mathbf{w}}_{\text{LMMSE}} = \Sigma_w \mathbf{X}^\top (\mathbf{X}\Sigma_w \mathbf{X}^\top + \sigma^2 \mathbf{I})^{-1} \mathbf{y}

This is equivalent to the MAP estimator under Gaussian assumptions and minimizes $\mathbb{E}[\|\mathbf{w} - \hat{\mathbf{w}}\|^2]$ over all linear estimators (biased or unbiased).

Wiener Filtering and Matched Filtering

Wiener Filter: In the frequency domain (for stationary signals), the LMMSE estimator becomes the Wiener filter, which shapes the spectrum based on signal and noise power spectral densities.
Matched Filter: For detection in AWGN, the optimal filter maximizes SNR by correlating with the known signal template—geometrically, this is projection onto the signal direction, exactly the OLS perspective.

6. Sequential Estimation: RLS and the Kalman Filter

So far, our solutions are "batch" — they require all data at once. In many real-world systems (robotics, finance, online services), data arrives sequentially. We need an efficient way to update our estimate.

Recursive Least Squares (RLS)

Assumptions: The parameter $\mathbf{w}$ is static, and observations $(\boldsymbol{\phi}_k, y_k)$ arrive sequentially with measurement model $y_k = \boldsymbol{\phi}_k^\top \mathbf{w} + v_k$ , where $v_k$ is zero-mean noise.

The RLS algorithm updates the estimate $\mathbf{w}_k$ and its error covariance $\mathbf{P}_k$ recursively:

Innovation (prediction error):

\nu_k = y_k - \boldsymbol{\phi}_k^\top \mathbf{w}_{k-1}

Gain:

\mathbf{g}_k = \frac{\mathbf{P}_{k-1}\boldsymbol{\phi}_k}{1 + \boldsymbol{\phi}_k^\top \mathbf{P}_{k-1}\boldsymbol{\phi}_k}

Parameter update:

\mathbf{w}_k = \mathbf{w}_{k-1} + \mathbf{g}_k \nu_k

Covariance update:

\mathbf{P}_k = (\mathbf{I} - \mathbf{g}_k \boldsymbol{\phi}_k^\top) \mathbf{P}_{k-1}

where $\mathbf{P}_k$ satisfies $\mathbf{P}_k^{-1} = \sum_{i=1}^k \boldsymbol{\phi}_i \boldsymbol{\phi}_i^\top$ (up to a noise variance factor). The gain $\mathbf{g}_k$ balances prior uncertainty and new information.

Forgetting factor: A variant uses $\lambda \in (0,1]$ to discount old data, improving tracking in non-stationary environments.

The Kalman Filter: A General Framework

The Kalman filter generalizes RLS to time-varying, dynamic systems. Consider a linear state-space model:

Process model:

\mathbf{x}_k = \mathbf{A} \mathbf{x}_{k-1} + \mathbf{w}_k, \quad \mathbf{w}_k \sim \mathcal{N}(\mathbf{0}, \mathbf{Q})

Measurement model:

y_k = \mathbf{H}_k \mathbf{x}_k + v_k, \quad v_k \sim \mathcal{N}(0, R)

The Kalman filter provides the optimal (LMMSE) estimate of $\mathbf{x}_k$ via a predict-correct cycle:

Prediction:

\hat{\mathbf{x}}_{k|k-1} = \mathbf{A} \hat{\mathbf{x}}_{k-1|k-1}

\mathbf{P}_{k|k-1} = \mathbf{A} \mathbf{P}_{k-1|k-1} \mathbf{A}^\top + \mathbf{Q}

Innovation and its covariance:

\nu_k = y_k - \mathbf{H}_k \hat{\mathbf{x}}_{k|k-1}

S_k = \mathbf{H}_k \mathbf{P}_{k|k-1} \mathbf{H}_k^\top + R

Kalman gain:

\mathbf{K}_k = \mathbf{P}_{k|k-1} \mathbf{H}_k^\top S_k^{-1}

Correction:

\hat{\mathbf{x}}_{k|k} = \hat{\mathbf{x}}_{k|k-1} + \mathbf{K}_k \nu_k

\mathbf{P}_{k|k} = (\mathbf{I} - \mathbf{K}_k \mathbf{H}_k) \mathbf{P}_{k|k-1}

RLS as a special case: For static parameters, set $\mathbf{A} = \mathbf{I}$ , $\mathbf{Q} = \mathbf{0}$ , $\mathbf{H}_k = \boldsymbol{\phi}_k^\top$ , and recover the RLS update exactly. The innovation $\nu_k$ is the new information not predicted by the model, and the gain $\mathbf{K}_k$ optimally weights it based on relative uncertainties $\mathbf{P}_{k|k-1}$ and $R$ .

7. The Bridge to Reinforcement Learning

This recursive, error-driven update structure forms a conceptual bridge to modern reinforcement learning. Consider the temporal-difference (TD) learning update for a value function $Q(s, a)$ :

Q(s, a) \leftarrow Q(s, a) + \alpha \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)

The structure mirrors the Kalman/RLS update:

\text{New Estimate} \leftarrow \text{Old Estimate} + \text{Gain} \times (\text{Innovation})

The TD error acts as the innovation signal, correcting the value estimate based on new experience. This parallel is not merely an analogy. When using linear function approximation for value functions ( $Q(s,a) \approx \mathbf{w}^\top \boldsymbol{\phi}(s,a)$ ), many RL algorithms (e.g., LSTD, TD( $\lambda$ )) are forms of recursive least-squares estimation. Both fields share the same core principle: using noisy, sequential data to recursively estimate hidden quantities—whether physical states or optimal values.## 8. Conclusion

Linear regression is far more than a simple curve-fitting tool. It is a microcosm of the core principles of estimation, learning, and information processing. By viewing it through multiple lenses, we uncover a profound web of connections:

Geometry reveals the solution as an orthogonal projection, with the SVD providing insight into ill-conditioning and the pseudoinverse.
Probability justifies it as the maximum likelihood estimator under Gaussian noise and proves it is BLUE via the Gauss-Markov theorem.
Bayesian Inference recasts regularization as encoding prior beliefs, with the bias-variance tradeoff governing generalization.
Information Theory connects Fisher information, the Cramér-Rao bound, and mutual information in linear Gaussian channels, revealing fundamental limits on estimation.
Signal Estimation extends the framework to GLS (correlated noise), LMMSE (random parameters), and Wiener/matched filtering.
Sequential Estimation transforms batch solutions into recursive algorithms (RLS, Kalman filter) that operate in real-time with precise covariance updates.
Reinforcement Learning shares the same recursive, innovation-driven structure, unifying value estimation with state estimation.

This journey reveals a unifying theme: learning is a process of updating beliefs from data under uncertainty. Linear regression, in its elegance and depth, is our first and most fundamental guide to this landscape—a bridge connecting geometry, probability, information theory, signal processing, control, and modern machine learning.

References and Further Reading

Linear Models and Estimation:

Seber, G. A. F., & Lee, A. J. (2003). Linear Regression Analysis (2nd ed.). Wiley.
Björck, Å. (1996). Numerical Methods for Least Squares Problems. SIAM.
Kay, S. M. (1993). Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory. Prentice Hall.

Bayesian Methods and Regularization:

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

Kalman Filtering and Recursive Estimation:

Simon, D. (2006). Optimal State Estimation. Wiley.
Grewal, M. S., & Andrews, A. P. (2014). Kalman Filtering: Theory and Practice Using MATLAB (4th ed.). Wiley.

Information Theory:

Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.

Diagnostics and Computational Methods:

Cook, R. D., & Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall.