Mean Squared Error Decomposition¶
TL;DR¶
\[\begin{split}\mathrm{MSR} \left[ f \right] &=
\mathrm{MSE} \left[ f \right] +
\mathrm{E} \left[ \epsilon^2 \right]
\\\\
\mathrm{MSE} \left[ f \right] &=
\mathrm{Var} \left[ f \right] +
\mathrm{Bias}^2 \left[ f \right]\end{split}\]
Derivation¶
Problem Definition¶
Suppose predictor \(x\) and response \(y\) follows a true relationship \(f^*\), which can be described as:
\[y = f^* (x) + \epsilon\]
where zero-mean random variable \(\epsilon\) has a constant variance:
\[\begin{split}\mathrm{E} \left[ \epsilon \right] &= 0
\\
\mathrm{Var} \left[ \epsilon \right] &= \sigma_{\epsilon}^2\end{split}\]
We use the hypothesis model \(f\) to get the prediction \(\hat{y}\):
\[\hat{y} = f(x)\]
Now the formulas can be represented as follows:
\[\begin{split}\mathrm{Var} \left[ f \right] &=
\mathrm{E} \left[ (f (x) - \mathrm{E} \left[ f(x) \right])^2 \right]
\\ &=
\mathrm{E} \left[
f^2 (x) - 2 f(x) \mathrm{E} \left[ f(x) \right] +
\mathrm{E}^2 \left[ f(x) \right]
\right]
\\ &=
\mathrm{E} \left[ f^2 (x) \right] -
2 \mathrm{E} \left[ f(x) \mathrm{E} \left[ f(x) \right] \right] +
\mathrm{E}^2 \left[ f (x) \right]
\\ &=
\mathrm{E} \left[ f^2 (x) \right] -
\mathrm{E}^2 \left[ f (x) \right]
\\\\
\mathrm{Bias} \left[ f \right] &=
\mathrm{E} \left[ f(x) - f^* (x) \right]
\\ &=
\mathrm{E} \left[ f(x) \right] - f^* (x)\end{split}\]
Residual vs. Error vs. Noise¶
Random Noise:
\[\epsilon\]
Model \(f\) prediction error:
\[e = f^* (x) - f (x)\]
Residual:
\[\begin{split}\hat{\epsilon} &=
y - \hat{y}
\\ &=
f^* (x) - f (x) + \epsilon
\\ &=
e + \epsilon\end{split}\]
Mean Squared Residual (MSR):
\[\begin{split}\mathrm{MSR} \left[ f \right] &=
\mathrm{E} \left[ (y - \hat{y})^2 \right]
\\ &=
\mathrm{E} \left[ (f^* (x) + \epsilon - f(x))^2 \right]
\\ &=
\mathrm{E} \left[ (e + \epsilon)^2 \right]\end{split}\]
Mean Squared Error (MSE):
\[\begin{split}\mathrm{MSE} \left[ f \right] &=
\mathrm{E} \left[ (f^* (x) - f(x))^2 \right]
\\ &=
\mathrm{E} \left[ e^2 \right]\end{split}\]
MSR Decomposition¶
The MSR equals the MSE plus an irreducible noise term if the model error and the noise are independent:
\[\mathrm{Cov} \left[ e, \epsilon \right] = 0
\implies
\mathrm{MSR} \left[ f \right] =
\mathrm{MSE} \left[ f \right] +
\mathrm{E} \left[ \epsilon^2 \right]\]
Proof:
\[\begin{split}\mathrm{MSR} \left[ f \right] &=
\mathrm{E} \left[
(e + \epsilon)^2
\right]
\\ &=
\mathrm{E} \left[
e^2 + 2 \epsilon e + \epsilon^2
\right]
\\ &=
\mathrm{E} \left[ e^2 \right] +
2 \mathrm{E} \left[ e \epsilon \right] +
\mathrm{E} \left[ \epsilon^2 \right]\end{split}\]
\[\begin{split}& \because
\mathrm{Cov} \left[ e, \epsilon \right] = 0
\\
& \therefore
\mathrm{E} \left[ e \epsilon \right] =
\mathrm{E} \left[ e \right] \mathrm{E} \left[ \epsilon \right] = 0\end{split}\]
\[\therefore
\mathrm{MSR} \left[ f \right] =
\mathrm{MSE} \left[ f \right] +
\mathrm{E} \left[ \epsilon^2 \right]\]
\[\tag*{$\blacksquare$}\]
MSE Decomposition¶
The MSE equals variance plus squared bias:
\[\mathrm{MSE} \left[ f \right] =
\mathrm{Var} \left[ f \right] +
\mathrm{Bias}^2 \left[ f \right]\]
Proof:
\[\begin{split}\mathrm{Var} \left[ f \right] +
\mathrm{Bias}^2 \left[ f \right] &=
\mathrm{E} \left[ f^2 (x) \right] -
\mathrm{E}^2 \left[ f (x) \right] +
(\mathrm{E} \left[ f(x) \right] - f^* (x))^2
\\ &=
\mathrm{E} \left[ f^2 (x) \right] -
2 f^* (x) \mathrm{E} \left[ f(x) \right] +
(f^*)^2 (x)
\\ &=
\mathrm{E} \left[
f^2 (x) - 2 f^* (x) f (x) + (f^*)^2 (x)
\right]
\\ &=
\mathrm{E} \left[ (f (x) - f^* (x))^2 \right]
\\ &=
\mathrm{MSE} \left[ f \right]\end{split}\]
\[\tag*{$\blacksquare$}\]
Comments¶
If noise term is independent of the model prediction error term:
Specifically, for unbiased model \(f\) (e.g. the OLS estimator):
which means the squared residual term can evaluate model performance, i.e. calculate the variance cannot be explained by the model.
Back to Statistical Learning.