Linear Regression: Ordinary Least Squares
TL;DR
OLS estimator is unbiased if:
OLS estimator is the Best Linear Unbiased Estimator (BLUE) if:
No Collinearity
\(\mathrm{E} \left[ \epsilon_i \right] = 0,\)
\(\forall i \in \left[ 1, n \right]\)
\(\mathrm{Var} \left[ \epsilon_i \right] = \sigma^2,\)
\(\forall i \in \left[ 1, n \right]\)
\(\mathrm{Cov} \left[ \epsilon_i, \epsilon_j \right] = 0,\)
\(\forall i \ne j\)
OLS estimator is equivalent to MLE (generalized linear model) if:
No Collinearity
\(\mathrm{E} \left[ \epsilon_i \right] = 0,\)
\(\forall i \in \left[ 1, n \right]\)
\(\mathrm{Var} \left[ \epsilon_i \right] = \sigma^2,\)
\(\forall i \in \left[ 1, n \right]\)
\(\mathrm{Cov} \left[ \epsilon_i, \epsilon_j \right] = 0,\)
\(\forall i \ne j\)
random errors are identically and independently drawn from a
normal distribution.
Hypothesis
By previous derivation we conclude that the linear
predictor \(f : \mathbb{R}^p \mapsto \mathbb{R}\) is a linear functional:
\[
f (x) = x^\top \beta
\]
where \(x \in \mathbb{R}^p\) as the input vector, \(\beta \in \mathbb{R}^p\) as the
linear coefficients.
We suppose that there is a underlying non-random unobservable coefficients
\(\beta^{\ast} \in \mathbb{R}^p\) such that for every data point
\(\{ x, y \} \in \mathbb{R}^p \times \mathbb{R}\):
\[
y = x^\top \beta^{\ast} + \varepsilon
\]
where \(\varepsilon\) is a random error term, which is assumed to be independent
of the input vector \(x\).
Also from previous chapter, the Ordinary Least Squares
(OLS) solution can be expressed as:
\[
\beta_{\text{ols}} = (X^\top X)^{-1} X^\top y
\]
If no collinearity holds i.e. \(X^\top X\) is invertible.
\[\begin{split}
\therefore
\beta_{\text{ols}}
&= (X^\top X)^{-1} X^\top (X \beta^{\ast} + \epsilon) \\
&= (X^\top X)^{-1} X^\top X \beta^{\ast} +
(X^\top X)^{-1} X^\top \epsilon \\
&= \beta^{\ast} + (X^\top X)^{-1} X^\top \epsilon
\end{split}\]
where \(\epsilon\) is the vector of random errors
\(\varepsilon_1, \ldots, \varepsilon_n\).
Zero Mean Error and Unbiased Estimator
\[
\mathrm{E} \left[ \epsilon \right] = 0
\implies
\mathrm{E} \left[ \beta_{\text{ols}} \right] = \beta^{\ast}
\]
Proof:
\[\begin{split}
\mathrm{E} \left[
\beta_{\text{ols}}
\right]
&= \mathrm{E} \left[
\beta^{\ast} + (X^\top X)^{-1} X^\top \epsilon
\right] \\
&= \beta^{\ast} + X^\top (X^\top X)^{-1} \mathrm{E} \left[ \epsilon \right] \\
&= \beta^{\ast}
\end{split}\]
\[\tag*{$\blacksquare$}\]
Homoscedasticity and BLUE
The OLS estimator is the Best Linear Unbiased Estimator (BLUE) if:
\[\begin{split}
\forall i \in \left[ 0, n \right] &:
\mathrm{E} \left[ \epsilon \right] = 0,
\mathrm{Var} \left[ \epsilon \right] = \sigma^2 \\
\forall i \ne j &:
\mathrm{Cov} \left[ \epsilon \right] = 0
\end{split}\]
Proof[17] [19]:
\[\begin{split}
\epsilon \epsilon^T &=
\begin{bmatrix}
\varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n
\end{bmatrix}
\cdot
\begin{bmatrix}
\varepsilon_1 & \varepsilon_2 & \cdots & \varepsilon_n
\end{bmatrix} \\
&= \begin{bmatrix}
\varepsilon_1 \varepsilon_1 & \varepsilon_1 \varepsilon_2 & \cdots &
\varepsilon_1 \varepsilon_n \\
\varepsilon_2 \varepsilon_1 & \varepsilon_2 \varepsilon_2 & \cdots &
\varepsilon_2 \varepsilon_n \\
\vdots & \vdots & \ddots & \vdots \\
\varepsilon_n \varepsilon_1 & \varepsilon_n \varepsilon_2 & \cdots &
\varepsilon_n \varepsilon_n
\end{bmatrix} \\
&= \begin{bmatrix}
(\varepsilon_1 - 0) (\varepsilon_1 - 0) &
(\varepsilon_1 - 0) (\varepsilon_2 - 0) & \cdots &
(\varepsilon_1 - 0) (\varepsilon_n - 0) \\
(\varepsilon_2 - 0) (\varepsilon_1 - 0) &
(\varepsilon_2 - 0) (\varepsilon_2 - 0) & \cdots &
(\varepsilon_2 - 0) (\varepsilon_n - 0) \\
\vdots & \vdots & \ddots & \vdots \\
(\varepsilon_n - 0) (\varepsilon_1 - 0) &
(\varepsilon_n - 0) (\varepsilon_2 - 0) & \cdots &
(\epsilon_n - 0) (\epsilon_n - 0)
\end{bmatrix}
\end{split}\]
\[\begin{split}
\therefore
\mathrm{E} \left[
\epsilon \epsilon^T
\right] &=
\begin{bmatrix}
\mathrm{E} \left[
(\varepsilon_1 - 0) (\varepsilon_1 - 0)
\right] &
\mathrm{E} \left[
(\varepsilon_1 - 0) (\varepsilon_2 - 0)
\right] &
\cdots &
\mathrm{E} \left[
(\varepsilon_1 - 0) (\varepsilon_n - 0)
\right] \\
\mathrm{E} \left[
(\varepsilon_2 - 0) (\varepsilon_1 - 0)
\right] &
\mathrm{E} \left[
(\varepsilon_2 - 0) (\varepsilon_2 - 0)
\right] &
\cdots &
\mathrm{E} \left[
(\varepsilon_2 - 0) (\varepsilon_n - 0)
\right] \\
\vdots & \vdots & \ddots & \vdots \\
\mathrm{E} \left[
(\varepsilon_n - 0) (\varepsilon_1 - 0)
\right] &
\mathrm{E} \left[
(\varepsilon_n - 0) (\varepsilon_2 - 0)
\right] &
\cdots &
\mathrm{E} \left[
(\varepsilon_n - 0) (\varepsilon_n - 0)
\right]
\end{bmatrix}
\\ &=
\begin{bmatrix}
\mathrm{Var} \left[ \varepsilon_1 \right] &
\mathrm{Cov} \left[ \varepsilon_1, \varepsilon_2 \right] &
\cdots &
\mathrm{Cov} \left[ \varepsilon_1, \varepsilon_n \right] \\
\mathrm{Cov} \left[ \varepsilon_2, \varepsilon_1 \right] &
\mathrm{Var} \left[ \varepsilon_2 \right] &
\cdots &
\mathrm{Cov} \left[ \varepsilon_2, \varepsilon_n \right] \\
\vdots & \vdots & \ddots & \vdots \\
\mathrm{Cov} \left[ \varepsilon_n, \varepsilon_1 \right] &
\mathrm{Cov} \left[ \varepsilon_n, \varepsilon_2 \right] &
\cdots &
\mathrm{Var} \left[ \varepsilon_n \right]
\end{bmatrix} \\
&= \sigma^2 I
\end{split}\]
\[\begin{split}
\therefore
\mathrm{Var} \left[ \beta_{\text{ols}} \right]
&= \mathrm{E} \left[
(\beta_{\text{ols}} - \mathrm{E} \left[ \beta_{\text{ols}} \right])
(\beta_{\text{ols}} - \mathrm{E} \left[ \beta_{\text{ols}} \right])^T
\right] \\
&= \mathrm{E} \left[
((X^\top X)^{-1} X^\top \epsilon)
((X^\top X)^{-1} X^\top \epsilon)^T
\right] \\
&= \mathrm{E} \left[
(X^\top X)^{-1} X^\top \epsilon
\epsilon^\top X (X^\top X)^{-1}
\right] \\
&= (X^\top X)^{-1} X^\top
\mathrm{E} \left[
\epsilon \epsilon^\top
\right]
X (X^\top X)^{-1} \\
&= \sigma^2 (X^\top X)^{-1}
X^\top X (X^\top X)^{-1} \\
&= \sigma^2 (X^\top X)^{-1}
\end{split}\]
Let \(\tilde{\beta} = \tilde{C} y\) be another
unbiased linear estimator of \(\beta^\ast\) with
\(\tilde{C} = (X^\top X)^{-1} X^\top + D\)
where \(D\) is a \(p \times n\) non-zero matrix.
\[\begin{split}
\therefore
\mathrm{E} \left[ \tilde{\beta} \right]
&= \mathrm{E} \left[ \tilde{C} y \right] \\
&= \mathrm{E} \left[
((X^\top X)^{-1} X^\top + D)
(X \beta^\ast + \epsilon)
\right] \\
&= ((X^\top X)^{-1} X^\top + D) X \beta^\ast +
((X^\top X)^{-1} X^\top + D)
\mathrm{E} \left[ \epsilon \right] \\
&= ((X^\top X)^{-1} X^\top + D) X \beta^\ast \\
&= \beta^\ast + D X \beta^\ast
\end{split}\]
Since \(\tilde{\beta}\) is unbiased:
\[
\therefore
D X = 0
\]
\[\begin{split}
\therefore
\mathrm{Var} \left[ \tilde{\beta} \right]
&= \mathrm{Var} \left[ \tilde{C} y \right] \\
&= \tilde{C} \mathrm{Var} \left[ y \right] \tilde{C}^\top \\
&= \sigma^2 \tilde{C} \tilde{C}^T \\
&= \sigma^2 ((X^\top X)^{-1} X^\top + D) ((X^\top X)^{-1} X^\top + D)^T \\
&= \sigma^2 ((X^\top X)^{-1} X^\top + D) (X (X^\top X)^{-1} + D^T) \\
&= \sigma^2 ((X^\top X)^{-1} + (X^\top X)^{-1} X^\top D^\top +
D X (X^\top X)^{-1} + D D^\top) \\
&= \sigma^2 (X^\top X)^{-1} + \sigma^2 (X^\top X)^{-1} (D X)^\top +
\sigma^2 D X (X^\top X)^{-1} + \sigma^2 D D^\top
\end{split}\]
\[
\because
D X = 0
\]
\[\begin{split}
\therefore
\mathrm{Var} \left[ \tilde{\beta} \right]
&= \sigma^2 (X^\top X)^{-1} + \sigma^2 D D^T \\
&= \mathrm{Var} \left[ \beta_{\text{ols}} \right] + \sigma^2 D D^\top
\end{split}\]
Since \(D D^\top\) is positive semidefinite matrix:
\[
\therefore
\mathrm{Var} \left[ \tilde{\beta} \right] >
\mathrm{Var} \left[ \beta_{\text{ols}} \right]
\]
\[\tag*{$\blacksquare$}\]
Normally Distributed Error and MLE
The OLS is mathematically equivalent to Maximum Likelihood Estimation if
the error term \(\varepsilon_1, \ldots, \varepsilon_n\) are identically and
independently distributed from a normal distribution of zero mean.
Proof:
\[
\because
\varepsilon_i =
y_i - \hat{y}_i =
y_i - x_i \beta
\sim
N(\mu, 0)
\]
\[\begin{split}
\therefore
\mathcal{L} (\beta \mid X) &=
\prod_{i=1}^{n} \frac{1}{\sigma \sqrt{2 \pi}}
e^{-\frac{(y_i - x_i \beta)^2}{2 \sigma^2}}
\\ &=
(\frac{1}{\sigma \sqrt{2 \pi}})^n
\prod_{i=1}^{n}
e^{-\frac{(y_i - x_i \beta)^2}{2 \sigma^2}}
\\ &=
(2 \pi \sigma^2)^{-\frac{n}{2}}
\prod_{i=1}^{n}
e^{-\frac{(y_i - x_i \beta)^2}{2 \sigma^2}}
\end{split}\]
\[\begin{split}
\therefore
\ln \mathcal{L} (\beta \mid X)
&= -\frac{n}{2} \ln (2 \pi \sigma^2) +
\sum_{i=1}^n -\frac{(y_i - x_i \beta)^2}{2 \sigma^2} \\
&= -\frac{n}{2} \ln (2 \pi \sigma^2) -
\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - x_i \beta)^2 \\
&= -\frac{n}{2} \ln (2 \pi \sigma^2) -
\frac{1}{2 \sigma^2} (y - X \beta)^\top (y - X \beta)
\end{split}\]
To minimize \(\ln \mathcal{L}\):
\[\begin{split}
\nabla_{\beta} \ln \mathcal{L} (\beta \mid X) = 0
\\
\implies
(y - X \beta)^T
(y - X \beta) = 0
\end{split}\]
\[
\therefore
\beta_{\text{mle}}
= (X^\top X)^{-1} X^\top y
= \beta_{\text{ols}}
\]
\[\tag*{$\blacksquare$}\]
Back to Statistical Learning.