Linear Regression: Formula Derivation¶
Functional Representation¶
Define a Hilbert space:
with inner product:
We are given data \(\{ (x_i, y_i) \}_{i=1}^N\), with \(x_i \in \mathcal{H}\), \(y_i \in \mathbb{R}\).
It turns out that the predictor is a linear functional:
By Riesz representation theorem, in a Hilbert space \(\mathcal{H}\), for every continuous linear functional \(f \in \mathcal{H}^*\), there exists a unique \(\beta \in \mathcal{H}\) such that:
Thus, fitting the model means finding \(\beta^{\ast}\) such that:
where \(\ell\) is the loss function.
Least Squares Solution¶
Here we use the squared loss:
Rewrite in matrix form:
where \(X\) is the \(N \times p\) design matrix with \(X_{ij} = x_{ij}\), \(y\) is the \(N \times 1\) vector of responses.
The most straightforward way to find \(\beta^{\ast}\) is to take the gradient w.r.t. \(\beta\) first:
Setting the gradient to zero gives:
We can solve \(\beta^{\ast}\) if \(X^\top X\) is invertible i.e. full rank. This is why the no collinearity assumption is needed for linear regression:
This is the formula for the least squares solution of linear regression.
Note
\(X^+ = (X^\top X)^{-1} X^\top\) is known as the Moore–Penrose inverse [21] of \(X\).
\(X^\top X\) is a Gram matrix[20] over reals, which is symmetric and positive semi-definite, and it is invertible if and only if its determinant is non-zero, which is equivalent to the condition that \(X\) has linearly independent columns. Particularly, if \(X\) is centered, then \(X^\top X\) is the scatter matrix of \(X\), proportional to the covariance matrix \(\text{cov} (X)\).
A more intuitive way is to note that response vector \(y\) is a fixed point in \(\mathbb{R}^N\). We can define:
where \(x^{(i)}\) is the \(i\)-th column of \(X\).
Thus \(X \beta \in \mathcal{S}\).
To minimize \(\| y - X \beta \|\) is to make \(X \beta\) the orthogonal projection of \(y\) onto \(\mathcal{S}\), so that \(y - X \beta\) is orthogonal to \(\mathcal{S}\), i.e.:
Rewrite in matrix form:
which is equivalent to the previous equation.
Back to Statistical Learning.