Why Linear Regression Is More Profound Than You Think: A Journey Through Estimation Theory

Mohammadmahdi Maharebi
linear algebraestimation theoryregressionpseudoinversekalman filtermachine learning

Linear regression is often the first algorithm encountered in machine learning, prized for its simplicity and interpretability. But this apparent simplicity is deceptive. Beneath the surface, linear regression serves as a gateway to a rich, interconnected world of geometric, probabilistic, and dynamic systems concepts.

This post revisits the classic least-squares problem to uncover the deeper mathematical structures at its core. We will journey through four distinct but convergent perspectives:

  1. Geometric Intuition: Viewing regression as a projection in vector spaces.
  2. Probabilistic Rigor: Framing it as a maximum likelihood estimation problem and invoking the Gauss-Markov theorem.
  3. Bayesian Inference: Understanding regularization as the incorporation of prior beliefs.
  4. Sequential Estimation: Evolving the batch solution into a recursive one suitable for real-time systems, leading us to the Kalman Filter and its surprising links to reinforcement learning.

1. The Geometric View: Regression as Projection

The most elegant and fundamental perspective on linear regression is geometric. We are given a set of input vectors ϕ(xi)Rd\boldsymbol{\phi}(x_i) \in \mathbb{R}^d and corresponding scalar outputs yiy_i. Our goal is to find a linear combination of the features that best approximates the outputs.

Let's assemble our data:

  • The design matrix XRN×dX \in \mathbb{R}^{N \times d}, whose rows are the feature vectors ϕ(xi)\boldsymbol{\phi}(x_i)^\top.
  • The target vector yRN\mathbf{y} \in \mathbb{R}^N.
  • The parameter vector wRd\mathbf{w} \in \mathbb{R}^d.

Our model predicts y^=Xw\hat{\mathbf{y}} = X\mathbf{w}. The vector y^\hat{\mathbf{y}} is, by definition, a linear combination of the columns of XX. This means y^\hat{\mathbf{y}} must lie in the column space of XX, denoted C(X)\mathcal{C}(X).

The ordinary least squares (OLS) problem seeks to find the parameter vector w\mathbf{w}^* that minimizes the squared Euclidean distance between the prediction and the true targets:

w=argminw  yXw2\mathbf{w}^* = \underset{\mathbf{w}}{\arg\min}\; \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2

Geometrically, this is equivalent to finding the vector y^=Xw\hat{\mathbf{y}} = \mathbf{X}\mathbf{w}^* in the column space C(X)\mathcal{C}(\mathbf{X}) that is closest to y\mathbf{y}. This vector is the orthogonal projection of y\mathbf{y} onto C(X)\mathcal{C}(\mathbf{X}).

This geometric condition implies that the residual vector, e=yXw\mathbf{e} = \mathbf{y} - \mathbf{X}\mathbf{w}^*, must be orthogonal to every vector in C(X)\mathcal{C}(\mathbf{X}). This can be stated succinctly as:

X(yXw)=0\mathbf{X}^\top (\mathbf{y} - \mathbf{X}\mathbf{w}^*) = \mathbf{0}

Rearranging this gives the celebrated normal equations:

XXw=Xy\mathbf{X}^\top \mathbf{X} \mathbf{w}^* = \mathbf{X}^\top \mathbf{y}

If the columns of X\mathbf{X} are linearly independent, the Gram matrix XX\mathbf{X}^\top \mathbf{X} is invertible, and we obtain the unique solution:

w=(XX)1Xy\boxed{\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}}

The Pseudoinverse and SVD

When XX\mathbf{X}^\top \mathbf{X} is singular or ill-conditioned, we use the Moore-Penrose pseudoinverse X+\mathbf{X}^+. Via the singular value decomposition (SVD), write:

X=UΣV\mathbf{X} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top

where URN×N\mathbf{U} \in \mathbb{R}^{N \times N} and VRd×d\mathbf{V} \in \mathbb{R}^{d \times d} are orthogonal, and ΣRN×d\mathbf{\Sigma} \in \mathbb{R}^{N \times d} contains the singular values σi0\sigma_i \geq 0. The pseudoinverse is:

X+=VΣ+U\mathbf{X}^+ = \mathbf{V} \mathbf{\Sigma}^+ \mathbf{U}^\top

where Σ+\mathbf{\Sigma}^+ replaces each nonzero σi\sigma_i with 1/σi1/\sigma_i. The solution w=X+y\mathbf{w}^* = \mathbf{X}^+ \mathbf{y} exists for any X\mathbf{X} and gives the minimum-norm solution when the system is underdetermined.

Conditioning: The condition number κ(XX)=(σmax/σmin)2\kappa(\mathbf{X}^\top \mathbf{X}) = (\sigma_{\max}/\sigma_{\min})^2 measures sensitivity to perturbations. Large κ\kappa indicates ill-conditioning, amplifying noise and motivating regularization.

The Hat Matrix and Leverage

The hat matrix (or projection matrix) is:

H=X(XX)1X\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top

It projects y\mathbf{y} onto C(X)\mathcal{C}(\mathbf{X}): y^=Hy\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}. The diagonal elements hiih_{ii} are the leverage values, quantifying the influence of observation ii on its own prediction. High leverage indicates potential outliers or influential points. The residual e=(IH)y\mathbf{e} = (\mathbf{I} - \mathbf{H})\mathbf{y} is orthogonal to C(X)\mathcal{C}(\mathbf{X}), as required.

2. The Probabilistic View: Maximum Likelihood & The Gauss-Markov Theorem

The geometric view is deterministic. To introduce statistical properties, we model the data-generating process. A standard assumption is that the outputs are generated by a linear model with additive, zero-mean Gaussian noise:

yi=wϕ(xi)+vi,viN(0,σ2)y_i = \mathbf{w}^\top \boldsymbol{\phi}(x_i) + v_i, \qquad v_i \sim \mathcal{N}(0, \sigma^2)

In vector form, this is y=Xw+v\mathbf{y} = X\mathbf{w} + \mathbf{v}, where vN(0,σ2I)\mathbf{v} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I).

From this, we can ask: what parameter vector w\mathbf{w} most likely generated the observed data y\mathbf{y}? The likelihood function gives us the probability density of the data given the parameters:

p(yw)=i=1N12πσ2exp((yiwϕ(xi))22σ2)exp(12σ2yXw2)p(\mathbf{y} \mid \mathbf{w}) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \mathbf{w}^\top \boldsymbol{\phi}(x_i))^2}{2\sigma^2}\right) \propto \exp\left(-\frac{1}{2\sigma^2} \|\mathbf{y} - X\mathbf{w}\|^2\right)

To find the Maximum Likelihood Estimator (MLE), we typically maximize the log-likelihood, which is equivalent and mathematically simpler:

logp(yw)=C12σ2yXw2\log p(\mathbf{y} \mid \mathbf{w}) = C - \frac{1}{2\sigma^2} \|\mathbf{y} - X\mathbf{w}\|^2

Maximizing this expression is equivalent to minimizing yXw2\|\mathbf{y} - X\mathbf{w}\|^2. The solution is identical to the OLS estimator:

w^ML=(XX)1Xy\boxed{\hat{\mathbf{w}}_{\mathrm{ML}} = (X^\top X)^{-1} X^\top \mathbf{y}}

This reveals that OLS is not merely a convenient heuristic; it is the statistically optimal estimator under the Gaussian noise assumption. This allows us to analyze its properties:

  • Unbiasedness: The estimator is correct on average: E[w^ML]=w\mathbb{E}[\hat{\mathbf{w}}_{\mathrm{ML}}] = \mathbf{w}.
  • Covariance: The uncertainty in our estimate is Cov(w^ML)=σ2(XX)1\mathrm{Cov}(\hat{\mathbf{w}}_{\mathrm{ML}}) = \sigma^2 (X^\top X)^{-1}.
  • Efficiency: The estimator achieves the Cramér-Rao Lower Bound, meaning it is the most precise unbiased estimator possible.

Crucially, the Gauss-Markov Theorem provides a weaker but more general guarantee: even if the noise is not Gaussian, as long as it is uncorrelated and has zero mean and constant variance (homoscedastic), the OLS estimator is the Best Linear Unbiased Estimator (BLUE). It has the minimum variance among all linear unbiased estimators.

Gauss-Markov Theorem: A Sketch

Setup: Consider any linear unbiased estimator of the form w~=Ay\tilde{\mathbf{w}} = \mathbf{A}\mathbf{y} where ARd×N\mathbf{A} \in \mathbb{R}^{d \times N}. For unbiasedness:

E[w~]=AE[y]=AXw=ww\mathbb{E}[\tilde{\mathbf{w}}] = \mathbf{A}\mathbb{E}[\mathbf{y}] = \mathbf{A}\mathbf{X}\mathbf{w} = \mathbf{w} \quad \forall \mathbf{w}

This requires AX=I\mathbf{A}\mathbf{X} = \mathbf{I}.

Variance: The covariance of w~\tilde{\mathbf{w}} is:

Cov(w~)=ACov(y)A=σ2AA\mathrm{Cov}(\tilde{\mathbf{w}}) = \mathbf{A} \mathrm{Cov}(\mathbf{y}) \mathbf{A}^\top = \sigma^2 \mathbf{A}\mathbf{A}^\top

Optimality: Among all matrices A\mathbf{A} satisfying AX=I\mathbf{A}\mathbf{X} = \mathbf{I}, the choice A=(XX)1X\mathbf{A}^* = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top minimizes the variance (in the sense of the Loewner ordering). This gives:

Cov(w^OLS)=σ2(XX)1\mathrm{Cov}(\hat{\mathbf{w}}_{\mathrm{OLS}}) = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}

Thus, OLS is BLUE.

3. The Bayesian View: From Regularization to Priors

The MLE perspective assumes we know nothing about w\mathbf{w} beforehand. Bayesian inference allows us to incorporate prior beliefs. We treat w\mathbf{w} as a random variable with a prior distribution p(w)p(\mathbf{w}).

Let's assume a zero-mean Gaussian prior for w\mathbf{w}, which encodes a belief that smaller parameter values are more likely:

wN(0,Σp)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_p)

Using Bayes' rule, the posterior distribution of w\mathbf{w} after observing the data is:

p(wy)=p(yw)p(w)p(y)p(yw)p(w)p(\mathbf{w} \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{w}) p(\mathbf{w})}{p(\mathbf{y})} \propto p(\mathbf{y} \mid \mathbf{w}) p(\mathbf{w})

The Maximum A Posteriori (MAP) estimate maximizes this posterior probability. Taking the negative log of the posterior gives:

logp(wy)12σ2yXw2+12wΣp1w-\log p(\mathbf{w} \mid \mathbf{y}) \propto \frac{1}{2\sigma^2} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{1}{2} \mathbf{w}^\top \Sigma_p^{-1} \mathbf{w}

If we assume a simple spherical prior Σp=τ2I\Sigma_p = \tau^2 \mathbf{I}, the MAP estimate becomes the minimizer of:

L(w)=yXw2+σ2τ2w2L(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{\sigma^2}{\tau^2} \|\mathbf{w}\|^2

The solution to this is:

w^MAP=(XX+λI)1Xywhereλ=σ2τ2\boxed{\hat{\mathbf{w}}_{\mathrm{MAP}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}} \quad \text{where} \quad \lambda = \frac{\sigma^2}{\tau^2}

This is exactly Ridge Regression. The regularization term, often seen as an ad-hoc penalty to prevent overfitting, is now revealed to be the consequence of a Gaussian prior on the parameters.

Posterior Covariance: The full posterior distribution is also Gaussian, with covariance:

Σpost=(1σ2XX+Σp1)1\Sigma_{\text{post}} = \left(\frac{1}{\sigma^2}\mathbf{X}^\top \mathbf{X} + \Sigma_p^{-1}\right)^{-1}

For the isotropic case, this becomes:

Σpost=σ2(XX+λI)1\Sigma_{\text{post}} = \sigma^2 (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1}

Notice that regularization reduces posterior uncertainty (smaller eigenvalues) at the cost of introducing bias.

Bias-Variance Decomposition

For a new test point x\mathbf{x}, the prediction y^=xw^\hat{y} = \mathbf{x}^\top \hat{\mathbf{w}} has expected squared error that decomposes into three components:

E[(yy^)2]=Bias2+Variance+σ2\mathbb{E}[(y - \hat{y})^2] = \text{Bias}^2 + \text{Variance} + \sigma^2

where:

  • Bias: (xwE[xw^])2(\mathbf{x}^\top \mathbf{w} - \mathbb{E}[\mathbf{x}^\top \hat{\mathbf{w}}])^2 measures systematic error.
  • Variance: E[(xw^E[xw^])2]\mathbb{E}[(\mathbf{x}^\top \hat{\mathbf{w}} - \mathbb{E}[\mathbf{x}^\top \hat{\mathbf{w}}])^2] measures sensitivity to training data.
  • Irreducible error: σ2\sigma^2 from the noise.

For OLS, bias is zero but variance can be large (especially when κ(XX)\kappa(\mathbf{X}^\top \mathbf{X}) is large). Ridge regression introduces bias by shrinking coefficients, but reduces variance, often improving overall test error—the fundamental bias-variance tradeoff.

Other priors lead to different regularizers. For instance, a Laplacian prior p(w)exp(αw1)p(\mathbf{w}) \propto \exp(-\alpha \|\mathbf{w}\|_1) results in an L1L_1 penalty, leading to LASSO regression, which encourages sparse solutions.

4. The Information-Theoretic View

Linear regression connects deeply to information theory through the Fisher information matrix and mutual information in linear Gaussian channels.

Fisher Information and the Cramér-Rao Bound

The Fisher information matrix quantifies how much information the data y\mathbf{y} carries about the parameters w\mathbf{w}. For the linear Gaussian model y=Xw+v\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{v} with vN(0,σ2I)\mathbf{v} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}), it is:

I(w)=1σ2XX\mathcal{I}(\mathbf{w}) = \frac{1}{\sigma^2} \mathbf{X}^\top \mathbf{X}

The Cramér-Rao Lower Bound (CRLB) states that for any unbiased estimator w^\hat{\mathbf{w}}:

Cov(w^)I(w)1=σ2(XX)1\mathrm{Cov}(\hat{\mathbf{w}}) \succeq \mathcal{I}(\mathbf{w})^{-1} = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}

The MLE/OLS estimator achieves this bound exactly, making it efficient—no unbiased estimator can do better.

Mutual Information in Linear Gaussian Channels

Consider w\mathbf{w} as a random signal with prior wN(0,Σw)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_w), and the observation model y=Xw+v\mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{v}. The mutual information between w\mathbf{w} and y\mathbf{y} is:

I(w;y)=12logdet(I+1σ2XΣwX)I(\mathbf{w}; \mathbf{y}) = \frac{1}{2}\log\det\left(\mathbf{I} + \frac{1}{\sigma^2}\mathbf{X}\Sigma_w \mathbf{X}^\top\right)

This quantifies how much observing y\mathbf{y} reduces uncertainty about w\mathbf{w}. Via the SVD of X\mathbf{X}, this depends on the singular values σi\sigma_i and the prior variance along each singular direction. Regularization (small Σw\Sigma_w) reduces mutual information but improves numerical stability and generalization.

5. Signal Estimation Theory Connections

Linear regression is a special case of broader signal estimation frameworks.

Generalized Least Squares (GLS)

When the noise is correlated with covariance Cov(v)=Σvσ2I\mathrm{Cov}(\mathbf{v}) = \mathbf{\Sigma}_v \neq \sigma^2 \mathbf{I}, OLS is no longer BLUE. The Generalized Least Squares (GLS) estimator is:

w^GLS=(XΣv1X)1XΣv1y\hat{\mathbf{w}}_{\text{GLS}} = (\mathbf{X}^\top \mathbf{\Sigma}_v^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Sigma}_v^{-1} \mathbf{y}

GLS is BLUE for correlated noise, transforming the problem via Σv1/2\mathbf{\Sigma}_v^{-1/2} to recover the OLS structure.

Linear Minimum Mean Square Error (LMMSE) Estimation

When w\mathbf{w} is random with known prior wN(0,Σw)\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \Sigma_w), the LMMSE estimator (also the Bayesian posterior mean) is:

w^LMMSE=ΣwX(XΣwX+σ2I)1y\hat{\mathbf{w}}_{\text{LMMSE}} = \Sigma_w \mathbf{X}^\top (\mathbf{X}\Sigma_w \mathbf{X}^\top + \sigma^2 \mathbf{I})^{-1} \mathbf{y}

This is equivalent to the MAP estimator under Gaussian assumptions and minimizes E[ww^2]\mathbb{E}[\|\mathbf{w} - \hat{\mathbf{w}}\|^2] over all linear estimators (biased or unbiased).

Wiener Filtering and Matched Filtering

  • Wiener Filter: In the frequency domain (for stationary signals), the LMMSE estimator becomes the Wiener filter, which shapes the spectrum based on signal and noise power spectral densities.
  • Matched Filter: For detection in AWGN, the optimal filter maximizes SNR by correlating with the known signal template—geometrically, this is projection onto the signal direction, exactly the OLS perspective.

6. Sequential Estimation: RLS and the Kalman Filter

So far, our solutions are "batch" — they require all data at once. In many real-world systems (robotics, finance, online services), data arrives sequentially. We need an efficient way to update our estimate.

Recursive Least Squares (RLS)

Assumptions: The parameter w\mathbf{w} is static, and observations (ϕk,yk)(\boldsymbol{\phi}_k, y_k) arrive sequentially with measurement model yk=ϕkw+vky_k = \boldsymbol{\phi}_k^\top \mathbf{w} + v_k, where vkv_k is zero-mean noise.

The RLS algorithm updates the estimate wk\mathbf{w}_k and its error covariance Pk\mathbf{P}_k recursively:

Innovation (prediction error):

νk=ykϕkwk1\nu_k = y_k - \boldsymbol{\phi}_k^\top \mathbf{w}_{k-1}

Gain:

gk=Pk1ϕk1+ϕkPk1ϕk\mathbf{g}_k = \frac{\mathbf{P}_{k-1}\boldsymbol{\phi}_k}{1 + \boldsymbol{\phi}_k^\top \mathbf{P}_{k-1}\boldsymbol{\phi}_k}

Parameter update:

wk=wk1+gkνk\mathbf{w}_k = \mathbf{w}_{k-1} + \mathbf{g}_k \nu_k

Covariance update:

Pk=(Igkϕk)Pk1\mathbf{P}_k = (\mathbf{I} - \mathbf{g}_k \boldsymbol{\phi}_k^\top) \mathbf{P}_{k-1}

where Pk\mathbf{P}_k satisfies Pk1=i=1kϕiϕi\mathbf{P}_k^{-1} = \sum_{i=1}^k \boldsymbol{\phi}_i \boldsymbol{\phi}_i^\top (up to a noise variance factor). The gain gk\mathbf{g}_k balances prior uncertainty and new information.

Forgetting factor: A variant uses λ(0,1]\lambda \in (0,1] to discount old data, improving tracking in non-stationary environments.

The Kalman Filter: A General Framework

The Kalman filter generalizes RLS to time-varying, dynamic systems. Consider a linear state-space model:

Process model:

xk=Axk1+wk,wkN(0,Q)\mathbf{x}_k = \mathbf{A} \mathbf{x}_{k-1} + \mathbf{w}_k, \quad \mathbf{w}_k \sim \mathcal{N}(\mathbf{0}, \mathbf{Q})

Measurement model:

yk=Hkxk+vk,vkN(0,R)y_k = \mathbf{H}_k \mathbf{x}_k + v_k, \quad v_k \sim \mathcal{N}(0, R)

The Kalman filter provides the optimal (LMMSE) estimate of xk\mathbf{x}_k via a predict-correct cycle:

Prediction:

x^kk1=Ax^k1k1\hat{\mathbf{x}}_{k|k-1} = \mathbf{A} \hat{\mathbf{x}}_{k-1|k-1} Pkk1=APk1k1A+Q\mathbf{P}_{k|k-1} = \mathbf{A} \mathbf{P}_{k-1|k-1} \mathbf{A}^\top + \mathbf{Q}

Innovation and its covariance:

νk=ykHkx^kk1\nu_k = y_k - \mathbf{H}_k \hat{\mathbf{x}}_{k|k-1} Sk=HkPkk1Hk+RS_k = \mathbf{H}_k \mathbf{P}_{k|k-1} \mathbf{H}_k^\top + R

Kalman gain:

Kk=Pkk1HkSk1\mathbf{K}_k = \mathbf{P}_{k|k-1} \mathbf{H}_k^\top S_k^{-1}

Correction:

x^kk=x^kk1+Kkνk\hat{\mathbf{x}}_{k|k} = \hat{\mathbf{x}}_{k|k-1} + \mathbf{K}_k \nu_k Pkk=(IKkHk)Pkk1\mathbf{P}_{k|k} = (\mathbf{I} - \mathbf{K}_k \mathbf{H}_k) \mathbf{P}_{k|k-1}

RLS as a special case: For static parameters, set A=I\mathbf{A} = \mathbf{I}, Q=0\mathbf{Q} = \mathbf{0}, Hk=ϕk\mathbf{H}_k = \boldsymbol{\phi}_k^\top, and recover the RLS update exactly. The innovation νk\nu_k is the new information not predicted by the model, and the gain Kk\mathbf{K}_k optimally weights it based on relative uncertainties Pkk1\mathbf{P}_{k|k-1} and RR.

7. The Bridge to Reinforcement Learning

This recursive, error-driven update structure forms a conceptual bridge to modern reinforcement learning. Consider the temporal-difference (TD) learning update for a value function Q(s,a)Q(s, a):

Q(s,a)Q(s,a)+α(r+γmaxaQ(s,a)Q(s,a))Q(s, a) \leftarrow Q(s, a) + \alpha \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)

The structure mirrors the Kalman/RLS update:

New EstimateOld Estimate+Gain×(Innovation)\text{New Estimate} \leftarrow \text{Old Estimate} + \text{Gain} \times (\text{Innovation})

The TD error acts as the innovation signal, correcting the value estimate based on new experience. This parallel is not merely an analogy. When using linear function approximation for value functions (Q(s,a)wϕ(s,a)Q(s,a) \approx \mathbf{w}^\top \boldsymbol{\phi}(s,a)), many RL algorithms (e.g., LSTD, TD(λ\lambda)) are forms of recursive least-squares estimation. Both fields share the same core principle: using noisy, sequential data to recursively estimate hidden quantities—whether physical states or optimal values.## 8. Conclusion

Linear regression is far more than a simple curve-fitting tool. It is a microcosm of the core principles of estimation, learning, and information processing. By viewing it through multiple lenses, we uncover a profound web of connections:

  • Geometry reveals the solution as an orthogonal projection, with the SVD providing insight into ill-conditioning and the pseudoinverse.
  • Probability justifies it as the maximum likelihood estimator under Gaussian noise and proves it is BLUE via the Gauss-Markov theorem.
  • Bayesian Inference recasts regularization as encoding prior beliefs, with the bias-variance tradeoff governing generalization.
  • Information Theory connects Fisher information, the Cramér-Rao bound, and mutual information in linear Gaussian channels, revealing fundamental limits on estimation.
  • Signal Estimation extends the framework to GLS (correlated noise), LMMSE (random parameters), and Wiener/matched filtering.
  • Sequential Estimation transforms batch solutions into recursive algorithms (RLS, Kalman filter) that operate in real-time with precise covariance updates.
  • Reinforcement Learning shares the same recursive, innovation-driven structure, unifying value estimation with state estimation.

This journey reveals a unifying theme: learning is a process of updating beliefs from data under uncertainty. Linear regression, in its elegance and depth, is our first and most fundamental guide to this landscape—a bridge connecting geometry, probability, information theory, signal processing, control, and modern machine learning.

References and Further Reading

Linear Models and Estimation:

  • Seber, G. A. F., & Lee, A. J. (2003). Linear Regression Analysis (2nd ed.). Wiley.
  • Björck, Å. (1996). Numerical Methods for Least Squares Problems. SIAM.
  • Kay, S. M. (1993). Fundamentals of Statistical Signal Processing, Volume I: Estimation Theory. Prentice Hall.

Bayesian Methods and Regularization:

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

Kalman Filtering and Recursive Estimation:

  • Simon, D. (2006). Optimal State Estimation. Wiley.
  • Grewal, M. S., & Andrews, A. P. (2014). Kalman Filtering: Theory and Practice Using MATLAB (4th ed.). Wiley.

Information Theory:

  • Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.

Diagnostics and Computational Methods:

  • Cook, R. D., & Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall.