The math behind mlp and autodiff
Computing Gradients
The field of matrix calculus, which is a part of multivariable calculus, is really not that different from the single-variable calculus. In multivariable calculus, functions are from $\mathbb{R}^n$ to $\mathbb{R}^m$, and you use gradients, Jacobians, etc. to express these notions. Matrix calculus organizes these derivatives when dealing with matrices and vectors. But the principles are similar: youβre still applying the chain rule, product rule, etc., just in higher dimensions with some absurd notations.
- From calculus 101: Given a function $f: (x_1, x_2, \cdots , x_n) \in \mathbb{R}^n \mapsto f(\mathbf{x}) \in \mathbb{R}$, the gradient $\partial f / \partial \mathbf{x} \in \mathbb{R}^n$ eqauls
- A linear transformation with $\mathbb{R}^n$ input and $\mathbb{R}^m$ output, is in $m \times n$ matrix (in column vector representation).
- The Jacobian matrix, which is a $m \times n$ matrix of partial derivatives, is a linear mapping from $\mathbf{x} \in \mathbb{R}^n$ to $\mathbf{f}(\mathbf{x}) \in \mathbb{R}^m$. $\partial \mathbf{f} / \partial \mathbf{x} \in \mathbb{M}^{m \times n}$ equals
We often write the Jacobian $\partial \mathbf{f} / \partial \mathbf{x}$ as $\partial \mathbf{f}(\mathbf{x})$, which maps the tangent space of the domain of $\mathbf{f}$ around the point $\mathbf{x}$ to the tangent space of the codomain of $\mathbf{f}$ around the point $\mathbf{f}(\mathbf{x})$.
Some building blocks
- Consider a simple layer of composition with $f: \mathbb{R}^n \to \mathbb{R}^n$: \(\mathbf{h} = f (\mathbf{z}) \quad \text{i.e.} \quad h_i = f(z_i)\)
- Consider a linear layer: \(\mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b} \quad \text{i.e.} \quad z_i = W_{ij} x_j + b_i\)
- For two vectors $\mathbf{u}$ and $\mathbf{v}$, the dot product is defined as $\mathbf{u}^T \mathbf{v} \in \mathbb{R}$.
The reason behind the choice of $\textbf{v}^T$ over $\textbf{v}$ seems somewhat unclear. The mapping from the given input $x \in \mathbb{R}^n$ and a tangent vector $v \in \mathbb{R}^n$ to the output tangent vector in $\mathbb{R}^m$, denoted by $(x, v) \mapsto \partial f(x) v$, with $\partial f: \mathbb{R}^n \to \mathbb{R}^n \to \mathbb{R}^m$.
A simple example: MLP and backpropagation
A multi-layer perceptron (MLP) with one hidden layer can be represented mathematically as follows: \(s = \mathbf{u}^T \mathbf{h} \longleftarrow \mathbf{h} = f(\mathbf{z}) \longleftarrow \mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b} \longleftarrow \mathbf{x} \text{ (input)}\)
\[\begin{aligned} \frac{\partial s}{\partial \mathbf{z}} &= \frac{\partial s}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \\ &= \mathbf{u}^T \text{diag}(f^\prime(z)) \\ &= \mathbf{u}^T \circ f^\prime(z) \\ &:= \delta \text{ (local error signal)} \end{aligned}\] \[\begin{aligned} \therefore \frac{\partial s}{\partial \mathbf{b}} &= \frac{\partial s}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{b}} = \delta \circ \mathbf{I} \end{aligned}\]In addition,
\[\begin{aligned} \left( \frac{\partial \mathbf{z}} {\partial \mathbf{W}} \right)_{ijk} &= \frac{\partial \mathbf{z}_i} {\partial \mathbf{W}_{jk}} = \begin{cases} \mathbf{x}_k & \text{ if } i = j \\ 0& \text{ otherwise } \\ \end{cases} = \mathbf{x} \mathbf{I} \text{ or } \mathbf{Ix}^T \end{aligned}\] \[\begin{aligned} \therefore \frac{\partial s}{\partial \mathbf{W}} &= \frac{\partial s}{\partial \mathbf{h}} \frac{\partial \mathbf{h}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{W}} \\ &= \delta^T \circ \mathbf{x}^T \end{aligned}\]Computers turn
grad_tanh = grad(jnp.tanh)
print(grad_tanh(2.0))
A general grad
is expected to take a function and returns a (differentiated=gradient) function. This means if you have a Python function f
that evaluates the mathematical function, then grad(f)
is a Python function that evaluates the mathematical function. That means grad(f)(x)
represents the resulting value.