SVM: Kernel Method¶
We seek \(f \in \mathcal{H}\), where \(\mathcal{H}\) is a Reproducing Kernel Hilbert Space (RKHS) of real-valued functions on \(\mathcal{X} \subseteq \mathbb{R}^d\).
By the definition of RKHS, the pointwise evaluation in \(\mathcal{H}\) for each \(x \in \mathcal{X}\)
is a continuous linear functional for all \(f \in \mathcal{H}\).
By the Riesz Representation Theorem:
To confirm the reproducing property, take \(f = k_x\):
We define the reproducing kernel as a function \(k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) such that:
Note
In the RKHS setting, the concept of “dimensionality increase” is abstract: functions live in an infinite-dimensional space. However, the “effective dimension” is governed by the kernel \(k\), which defines the structure of \(\mathcal{H}\). Once a kernel is chosen, the hypothesis space is fixed, and no additional dimensions are introduced during learning. Later the representer theorem also shows that the solution is a finite linear combination of kernel functions evaluated at the training points. In fact, some people use the notation \(\mathcal{H}_k\) to denote the RKHS associated with kernel \(k\). For simplicity we use \(\mathcal{H}\).
Empirical Risk¶
To find the optimal function \(f^*\), we need to minimize the empirical risk:
Where:
\((x_i, y_i) \in \mathcal{X} \times \mathbb{R}\) are the training data,
\(R\) is a non-decreasing regularizer (e.g., \(R(u) = \lambda u\)),
\(E\) is a empirical loss function (e.g., squared loss, hinge loss).
By representer theorem, any minimizer of the above problem admits the form:
where \(\alpha_i \in \mathbb{R}\).
We will use it to compute the empirical risk. First, we compute the norm:
where
\(K \in \mathbb{R}^{n \times n}\) is the Gram matrix with entries \(K_{ij} = k(x_i, x_j)\).
\(\alpha \in \mathbb{R}^n\) is the vector of coefficients \(\alpha_i\).
The empirical risk depends on the predictions on the training data, which is a vector:
Expanded as a matrix-vector product:
Which is precisely the definition of a matrix-vector product:
Derivation of Solution¶
Now the empirical risk can be expressed as (for simplicity, we will use squared loss):
To minimize, take the gradient w.r.t. \(\alpha\):
Set gradient to zero:
Assuming \(K\) is positive definite (or at least invertible), this implies:
Solution:
Appendix: Proof of Riesz Representation Theorem¶
Let \(\mathcal{H}\) be a Hilbert space of real-valued functions on \(\mathcal{X} \subseteq \mathbb{R}^d\), equipped with an inner product \(\langle \cdot, \cdot \rangle_{\mathcal{H}}\). For any bounded linear functional \(L: \mathcal{H} \to \mathbb{R}\), we want to show that there exists a unique \(g \in \mathcal{H}\) such that
Proof:
For any \(f \in \mathcal{H}\), for the trivial case when \(L(f) = 0\), we can always take \(g = 0\), which gives \(\langle f, 0 \rangle = 0 = L(f)\).
Let us consider the case when \(L(f) \ne 0\) for some \(f \in \mathcal{H}\).
First, we define the kernel and the orthogonal complement of the kernel of \(L\):
Since \(L\) is a bounded linear operator, \(\ker(L)\) is a closed subspace of \(\mathcal{H}\). Thus we can apply the orthogonal decomposition theorem:
and any \(f \in \mathcal{H}\) can be uniquely decomposed as:
where \(f_0 \in \ker(L)\) and \(g_0 \in (\ker L)^\perp\). Note that \(g_0 \ne 0\) as \(L(f) \ne 0\).
Let:
To prove the uniqueness of \(g\), suppose there are \(g_1, g_2 \in \mathcal{H}\) such that
Take \(f = g_1 - g_2\):
Back to Statistical Learning.