Gradient in Backpropagation

TL;DR

TBD[1].

\[ \tag*{$\blacksquare$} \]

Fully Connected Layer

Suppose:

\[\mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b}\]

where

\[\begin{split} \mathbf{x} = (x_1, x_2, \dots, x_n)^T \\ \mathbf{z} = (z_1, z_2, \dots, z_m)^T \\ \mathbf{b} = (b_1, b_2, \dots, b_m)^T \end{split}\]
\[\begin{split} \mathbf{W} = \begin{pmatrix} w_{11} & w_{12} & \dots & w_{1n} \\ w_{21} & w_{22} & \dots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m1} & w_{m2} & \dots & w_{mn} \end{pmatrix} \end{split}\]
\[ \therefore z_i = \sum_{j=1}^n w_{ij} x_j + b_i \]
\[ \therefore \frac{\partial z_i}{\partial w_{ij}} = x_j \]
\[ \frac{\partial z_i}{\partial b_i} = 1 \]

Gradient

We define some kind of “gradient” of the output with respect to each element of the weight matrix:

\[\begin{split} \frac{\partial L}{\partial \mathbf{W}} = \begin{pmatrix} \frac{\partial L}{\partial w_{11}} & \dots & \frac{\partial L}{\partial w_{1n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial w_{m1}} & \dots & \frac{\partial L}{\partial w_{mn}} \end{pmatrix} \end{split}\]

Note that this is NOT the Jacobian matrix, which should be a 3D tensor. What we really want is some delta, with which we can update the weight matrix.

\[\begin{split} \frac{\partial L}{\partial \mathbf{W}} &= \begin{pmatrix} \frac{\partial L}{\partial y_1} \frac{\partial y_1}{\partial z_1} \frac{\partial z_1}{\partial w_{11}} & \dots & \frac{\partial L}{\partial y_1} \frac{\partial y_1}{\partial z_1} \frac{\partial z_1}{\partial w_{1n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial y_m} \frac{\partial y_m}{\partial z_m} \frac{\partial z_m}{\partial w_{m1}} & \dots & \frac{\partial L}{\partial y_m} \frac{\partial y_m}{\partial z_m} \frac{\partial z_m}{\partial w_{mn}} \end{pmatrix} \\ &= \begin{pmatrix} \frac{\partial L}{\partial y_1} f' (z_1) x_1 & \dots & \frac{\partial L}{\partial y_1} f' (z_1) x_n \\ \vdots & \ddots & \vdots \\ \frac{\partial L}{\partial y_m} f' (z_m) x_1 & \dots & \frac{\partial L}{\partial y_m} f' (z_m) x_n \end{pmatrix} \\ &= (\frac{\partial L}{\partial y_1}, \dots, \frac{\partial L}{\partial y_m})^T \odot (f' (z_1), \dots, f' (z_m))^T \cdot \mathbf{x}^T \\ &= \nabla_{\mathbf{y}} L \odot \nabla_{\mathbf{z}} f \cdot \mathbf{x}^T \end{split}\]

Back to Statistical Learning.