LDA: Dimensionality Reduction
Linear Discriminant Analysis (LDA) is widely considered as a dimensionality
reduction technique with a goal of finding a linear projection
\(\mathbf{w}\) that maximizes the ratio of between-class variance to
within-class variance.
Within / Between Class Variance
Suppose we have a \(p\)-dimensional dataset
\(\mathcal{D} = \{\mathbf{x}_i\}_{i=1}^N\) with \(N\) samples and
\(K\) classes.
The mean vector of each class \(\mathbf{\mu}_k\) is defined as:
\[\mathbf{\mu}_k =
\frac{1}{N_k} \sum_{\mathbf{x} \in \mathcal{D}_k} \mathbf{x}\]
where \(N_k\) is the number of samples in class \(k\) and
\(\mathcal{D}_k\) is the set of samples in class \(k\).
The within-class scatter matrix for class \(k\) is defined as:
\[\mathbf{S}_{W_k} =
\sum_{\mathbf{x} \in \mathcal{D}_k}
(\mathbf{x} - \mathbf{\mu}_k) (\mathbf{x} - \mathbf{\mu}_k)^T\]
The within-class scatter matrix for the entire dataset is defined as:
\[\mathbf{S}_W = \sum_{k=1}^K \mathbf{S}_{W_k}\]
The between-class scatter matrix is defined as:
\[\mathbf{S}_B =
\sum_{k=1}^K
N_k (\mathbf{\mu}_k - \mathbf{\mu}) (\mathbf{\mu}_k - \mathbf{\mu})^T\]
where \(\mathbf{\mu}\) is the mean vector of the entire dataset:
\[\mathbf{\mu} = \frac{1}{N} \sum_{\mathbf{x} \in \mathcal{D}} \mathbf{x}\]
Optimal Projection
Suppose the optimal projection is \(\mathbf{w}\).
The original dataset can be projected onto \(\mathbf{w}\) to obtain a
one-dimensional dataset \(\mathcal{D}' = \{y_i\}_{i=1}^N\) where
\(y_i = \mathbf{w}^T \mathbf{x}_i\).
The projected mean of each class \(\mu_k'\) is defined as:
\[\begin{split}\mu_k' &= \frac{1}{N_k} \sum_{y \in \mathcal{D}'_k} y
\\ &=
\frac{1}{N_k} \sum_{\mathbf{x} \in \mathcal{D}_k} \mathbf{w}^T \mathbf{x}
\\ &=
\mathbf{w}^T \mathbf{\mu}_k\end{split}\]
The projected mean of the entire projected dataset is defined as:
\[\begin{split}\mu' &= \frac{1}{N} \sum_{y \in \mathcal{D}'} y
\\ &=
\frac{1}{N} \sum_{\mathbf{x} \in \mathcal{D}} \mathbf{w}^T \mathbf{x}
\\ &=
\mathbf{w}^T \mathbf{\mu}\end{split}\]
The within-class scatter matrix for the projected dataset is defined as:
\[\begin{split}\mathbf{S}'_{W} &=
\sum_{k=1}^K \sum_{y \in \mathcal{D}'_k} (y - \mu_k')^2
\\ &=
\sum_{k=1}^K \sum_{y \in \mathcal{D}'_k}
(y - \mu_k') (y - \mu_k')^T
\\ &=
\sum_{k=1}^K \sum_{\mathbf{x} \in \mathcal{D}_k}
(\mathbf{w}^T \mathbf{x} - \mathbf{w}^T \mathbf{\mu}_k)
(\mathbf{w}^T \mathbf{x} - \mathbf{w}^T \mathbf{\mu}_k)^T
\\ &=
\sum_{k=1}^K \sum_{\mathbf{x} \in \mathcal{D}_k}
\mathbf{w}^T (\mathbf{x} - \mathbf{\mu}_k)
(\mathbf{x} - \mathbf{\mu}_k)^T \mathbf{w}
\\ &=
\mathbf{w}^T \mathbf{S}_W \mathbf{w}\end{split}\]
The between-class scatter matrix for the projected dataset is defined as:
\[\begin{split}\mathbf{S}'_{B} &=
\sum_{k=1}^K N_k (\mu_k' - \mu')^2
\\ &=
\sum_{k=1}^K N_k (\mu_k' - \mu') (\mu_k' - \mu')^T
\\ &=
\sum_{k=1}^K N_k
(\mathbf{w}^T \mathbf{\mu}_k - \mathbf{w}^T \mathbf{\mu})
(\mathbf{w}^T \mathbf{\mu}_k - \mathbf{w}^T \mathbf{\mu})^T
\\ &=
\sum_{k=1}^K N_k
\mathbf{w}^T (\mathbf{\mu}_k - \mathbf{\mu})
(\mathbf{\mu}_k - \mathbf{\mu})^T \mathbf{w}
\\ &=
\mathbf{w}^T \mathbf{S}_B \mathbf{w}\end{split}\]
The ratio of between-class variance to within-class variance is defined as:
\[\begin{split}J(\mathbf{w}) &=
\frac{\mathbf{S}'_{B}}{\mathbf{S}'_{W}}
\\ &=
\frac{\mathbf{w}^T \mathbf{S}_B \mathbf{w}} {\mathbf{w}^T \mathbf{S}_W \mathbf{w}}\end{split}\]
Lagrangian Function
Our goal is to find the optimal projection \(\mathbf{w}\) that maximizes
\(J(\mathbf{w})\).
However, if we multiply \(\mathbf{w}\) by a constant, \(J(\mathbf{w})\)
will not change, i.e: \(J(\mathbf{w}) = J(c \mathbf{w})\) for any
constant \(c \neq 0\).
By introducing the constraint \(\mathbf{w}^T \mathbf{S}_W \mathbf{w} = 1\),
we ensure that the solution is unique.
We define the Lagrangian function as:
\[L(\mathbf{w}, \lambda) =
\mathbf{w}^T \mathbf{S}_B \mathbf{w} -
\lambda (\mathbf{w}^T \mathbf{S}_W \mathbf{w} - 1)\]
The stationary point of \(L(\mathbf{w}, \lambda)\) can be found by
solving the following equation:
\[\begin{split}\frac{\partial L}{\partial \mathbf{w}} &=
2 \mathbf{S}_B \mathbf{w} - 2 \lambda \mathbf{S}_W \mathbf{w}
\\ &= 0\end{split}\]
which is equivalent to:
\[\mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w}\]
When \(\mathbf{S}_W\) is nonsingular , the equation can be further
simplified to:
\[\mathbf{S}_W^{-1} \mathbf{S}_B \mathbf{w} = \lambda \mathbf{w}\]
By plugging back into \(J(\mathbf{w})\), we get:
\[\begin{split}J(\mathbf{w}) &=
\frac{\mathbf{w}^T \mathbf{S}_B \mathbf{w}}
{\mathbf{w}^T \mathbf{S}_W \mathbf{w}}
\\ &=
\frac{\mathbf{w}^T \lambda \mathbf{S}_W \mathbf{w}}
{\mathbf{w}^T \mathbf{S}_W \mathbf{w}}
\\ &=
\lambda\end{split}\]
Therefore maximizing \(J(\mathbf{w})\) is equivalent to
finding the eigenvectors corresponding to the largest eigenvalues
of the matrix \(\mathbf{S}_W^{-1} \mathbf{S}_B\).
Back to Statistical Learning.