SVM: Kernel Method

We seek \(f \in \mathcal{H}\), where \(\mathcal{H}\) is a Reproducing Kernel Hilbert Space (RKHS) of real-valued functions on \(\mathcal{X} \subseteq \mathbb{R}^d\).

\[ f : \mathcal{X} \to \mathbb{R}, \quad \mathcal{X} \subseteq \mathbb{R}^d \]

By the definition of RKHS, the pointwise evaluation in \(\mathcal{H}\) for each \(x \in \mathcal{X}\)

\[ L_x : f \mapsto f(x) \]

is a continuous linear functional for all \(f \in \mathcal{H}\).

By the Riesz Representation Theorem:

\[ \forall x \in \mathcal{X}: \quad \exists k_x \in \mathcal{H} \quad S.T. \quad f(x) = L_x (f) = \langle f, k_x \rangle_{\mathcal{H}} \quad \forall f \in \mathcal{H} \]

To confirm the reproducing property, take \(f = k_x\):

\[ k_x (y) = L_y (k_x) = \langle k_x, k_y \rangle_{\mathcal{H}} \]

We define the reproducing kernel as a function \(k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) such that:

\[ k(x, y) = \langle k_x, k_y \rangle_{\mathcal{H}} \]

Note

In the RKHS setting, the concept of “dimensionality increase” is abstract: functions live in an infinite-dimensional space. However, the “effective dimension” is governed by the kernel \(k\), which defines the structure of \(\mathcal{H}\). Once a kernel is chosen, the hypothesis space is fixed, and no additional dimensions are introduced during learning. Later the representer theorem also shows that the solution is a finite linear combination of kernel functions evaluated at the training points. In fact, some people use the notation \(\mathcal{H}_k\) to denote the RKHS associated with kernel \(k\). For simplicity we use \(\mathcal{H}\).

Empirical Risk

To find the optimal function \(f^*\), we need to minimize the empirical risk:

\[ f^* = \min_{f \in \mathcal{H}} \; R(\|f\|_{\mathcal{H}}^2) + E(f(x_1), \ldots, f(x_n)) \]

Where:

  • \((x_i, y_i) \in \mathcal{X} \times \mathbb{R}\) are the training data,

  • \(R\) is a non-decreasing regularizer (e.g., \(R(u) = \lambda u\)),

  • \(E\) is a empirical loss function (e.g., squared loss, hinge loss).

By representer theorem, any minimizer of the above problem admits the form:

\[ f^* = \sum_{i=1}^n \alpha_i k_{x_i} \]

where \(\alpha_i \in \mathbb{R}\).

We will use it to compute the empirical risk. First, we compute the norm:

\[ \|f\|_{\mathcal{H}}^2 = \left\langle \sum_{i=1}^n \alpha_i k_{x_i}, \sum_{j=1}^n \alpha_j k_{x_j} \right\rangle_{\mathcal{H}} = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \langle k_{x_i}, k_{x_j} \rangle_{\mathcal{H}} = \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j k(x_i, x_j) \]
\[ \therefore \|f\|_{\mathcal{H}}^2 = \alpha^\top K \alpha \]

where

  • \(K \in \mathbb{R}^{n \times n}\) is the Gram matrix with entries \(K_{ij} = k(x_i, x_j)\).

  • \(\alpha \in \mathbb{R}^n\) is the vector of coefficients \(\alpha_i\).

The empirical risk depends on the predictions on the training data, which is a vector:

\[\begin{split} \mathbf{f} = \begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^n \alpha_i k(x_1, x_i) \\ \sum_{i=1}^n \alpha_i k(x_2, x_i) \\ \vdots \\ \sum_{i=1}^n \alpha_i k(x_n, x_i) \end{bmatrix} \end{split}\]

Expanded as a matrix-vector product:

\[\begin{split} \mathbf{f} = \begin{bmatrix} k(x_1, x_1) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & \cdots & k(x_2, x_n) \\ \vdots & \ddots & \vdots \\ k(x_n, x_1) & \cdots & k(x_n, x_n) \end{bmatrix} \begin{bmatrix} \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_n \end{bmatrix} \end{split}\]

Which is precisely the definition of a matrix-vector product:

\[ \mathbf{f} = K \alpha \]

Derivation of Solution

Now the empirical risk can be expressed as (for simplicity, we will use squared loss):

\[ \min_{\alpha \in \mathbb{R}^n} \; \frac{\lambda}{2} \alpha^\top K \alpha + \frac{1}{2} \|K \alpha - y\|^2 \]

To minimize, take the gradient w.r.t. \(\alpha\):

\[ \nabla_\alpha \left( \frac{\lambda}{2} \alpha^\top K \alpha + \frac{1}{2} \|K \alpha - y\|^2 \right) = \lambda K \alpha + K (K \alpha - y) = K (\lambda \alpha + K \alpha - y) \]

Set gradient to zero:

\[ K (\lambda \alpha + K \alpha - y) = 0 \]

Assuming \(K\) is positive definite (or at least invertible), this implies:

\[ (\lambda I + K) \alpha = y \]

Solution:

\[ \boxed{ \alpha = (\lambda I + K)^{-1} y } \]

Appendix: Proof of Riesz Representation Theorem

Let \(\mathcal{H}\) be a Hilbert space of real-valued functions on \(\mathcal{X} \subseteq \mathbb{R}^d\), equipped with an inner product \(\langle \cdot, \cdot \rangle_{\mathcal{H}}\). For any bounded linear functional \(L: \mathcal{H} \to \mathbb{R}\), we want to show that there exists a unique \(g \in \mathcal{H}\) such that

\[ \forall f \in \mathcal{H}: \quad L(f) = \langle f, g \rangle_{\mathcal{H}} \]

Proof:

For any \(f \in \mathcal{H}\), for the trivial case when \(L(f) = 0\), we can always take \(g = 0\), which gives \(\langle f, 0 \rangle = 0 = L(f)\).

Let us consider the case when \(L(f) \ne 0\) for some \(f \in \mathcal{H}\).

First, we define the kernel and the orthogonal complement of the kernel of \(L\):

\[ \ker L = \{ f \in \mathcal{H} : L(f) = 0 \}. \]
\[ (\ker L)^\perp = \{ g \in \mathcal{H} : \langle g, f \rangle = 0, \quad \forall f \in \ker L \} \]

Since \(L\) is a bounded linear operator, \(\ker(L)\) is a closed subspace of \(\mathcal{H}\). Thus we can apply the orthogonal decomposition theorem:

\[ \mathcal{H} = \ker(L) \oplus (\ker L)^\perp \]

and any \(f \in \mathcal{H}\) can be uniquely decomposed as:

\[ f = f_0 + g_0 \]

where \(f_0 \in \ker(L)\) and \(g_0 \in (\ker L)^\perp\). Note that \(g_0 \ne 0\) as \(L(f) \ne 0\).

Let:

\[ g = \frac{L(g_0)}{\|g_0\|^2} g_0 \]
\[ \therefore \langle g_0, g \rangle = \left\langle g_0, \frac{L(g_0)}{\|g_0\|^2} g_0 \right\rangle = \frac{\|g_0\|^2}{\|g_0\|^2} L(g_0) = L(g_0). \]
\[ \because L(f) = L(f_0 + g_0) = 0 + L(g_0) = L(g_0). \]
\[ \because \langle f, g \rangle = \langle f_0 + g_0, g \rangle = \langle g_0, g \rangle = L(g_0) \]
\[ \therefore L(f) = \langle f, g \rangle_{\mathcal{H}} \]

To prove the uniqueness of \(g\), suppose there are \(g_1, g_2 \in \mathcal{H}\) such that

\[ L(f) = \langle f, g_1 \rangle = \langle f, g_2 \rangle \quad \forall f \in \mathcal{H}. \]
\[ \therefore \langle f, g_1 - g_2 \rangle = 0 \quad \forall f \in \mathcal{H}. \]

Take \(f = g_1 - g_2\):

\[ \|g_1 - g_2\|^2 = 0 \Rightarrow g_1 = g_2. \]
\[ \tag*{$\blacksquare$} \]

Back to Statistical Learning.