Autograd

This note discusses how autograd works.

Remember in the note of fundamentals, we mentioned that the machine learning model is

$$y = f_{model}(x_1, x_2, x_3, …, x_m, w_1, w_2, w_3, …, w_k)$$

and we would like to minimize the loss

$$L = loss(y, \tilde{y})$$

In order to do that, we’ll need to calculate $\frac{\partial L}{\partial w_i}$ for each $i$ from $1$ to $k$. We were able to do it manually. In this note we discuss how to calculate them automatically.

Problem Definition

Although we can make autograd to tackle the above issue directly, we would like to make it more general.

More generally, the goal of autograd is, given a function implemented in python (or any other programming language), return its gradient function in the same programming language.

To explain this in a mathematically way, consider the given function is

$$f(x_1, x_2, x_3, … x_m) = (y_1, y_2, y_3, … y_n)$$

our goal is to calculate $\frac{\partial y_j}{\partial x_i}$ for each $(i, j)$ pair. In other words, the Jacobian Matrix.

$$ J_f = \begin{pmatrix} \frac{\partial y_1}{\partial x_1} & … & \frac{\partial y_1}{\partial x_m} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_n}{\partial x_1} & … & \frac{\partial y_n}{\partial x_m} \\ \end{pmatrix} $$

For example, if $f$ is

$$f(x_1, x_2) = x_1 * x_2 + ln(x_1)$$

autograd should return us

$$J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$$

Note that in this general form, $x_i$ can either be an input (e.g, $x_i$ in the $f_{model}$), or a weight (e.g., $w_i$ in the $f_{model}$).

Solutions

There are many potential solutions for autograd.

Numerical Differentiation

The simplest way would be using numerical differentiation, i.e.,

$$\frac{\partial y}{\partial x_j} = \frac{f(x_1, …, x_j + \delta, …, x_m) - f(x_1, …, x_j, …, x_m)}{\delta}$$

However, this solution has a few technical challenges, making it unsuitable for ML model training purpose:

  • It’s hard to decide what is the right value for the $\delta$.
    • If it’s too large, the result might not be close to the gradient at all.
    • If it’s too small, the result is not going to be accurate since computer only has limit precision.
  • It’s computational heavily.

Symbolic Differentiation

Another solution would be to represent the model in symbolics, and then derive derivatives based on the chain rule.

This is also not quite feasible, since

  • The form of the functions quickly get complicated.
  • It’s not quite straightforward to convert the python source code into the symbolic representation. This is even more challenging if the source code has some control flow logics in it.

Graph Computation

Graph computation means we represent the target function in a computation graph, and then derive the derivatives from the graph with the chain rule.

For example, if $f(x_1, x_2) = x_1 * x_2 + ln(x_1)$, the computation graph looks like this

In this graph, circles means data (or tensors), boxes mean ops.

To calculate $J_f$, we just need to get $\frac{\partial y}{\partial x_1}$ and $\frac{\partial y}{\partial x_2}$. We have two options to get them, forward mode and reverse mode.

Forward Mode

In the forward mode, to calculate $\frac{\partial y}{\partial x_1}$, we first calculate $\frac{\partial v_0}{\partial x_1}$, then $\frac{\partial v_1}{\partial x_1}$, then $\frac{\partial v_2}{\partial x_1}$, … finally $\frac{\partial v_4}{\partial x_1}$, which is $\frac{\partial y}{\partial x_1}$

This is the calculation process for evaluation point $x_1 = e, x_2 = 1$.

$i$$v_i$$\frac{\partial v_i}{\partial x_1}$
0$e$$1$
1$1$$0$
2$e$$1 (= \frac{\partial v_0}{\partial x_1}v_1 + v_0 \frac{\partial v_1}{\partial x_1})$
3$1$$\frac{1}{e} (= \frac{1}{v_0} \frac{\partial v_0}{\partial x_1})$
4$e + 1$$1 + \frac{1}{e}(= \frac{\partial v_2}{\partial x_1} + \frac{\partial v_3}{\partial x_1})$

Similarly, we can also calculate $\frac{\partial y}{\partial x_2}$.

$i$$v_i$$\frac{\partial v_i}{\partial x_2}$
0$e$$0$
1$1$$1$
2$e$$e (= \frac{\partial v_0}{\partial x_2} v_1 + v_0 \frac{\partial v_1}{\partial x_2})$
3$1$$0 (= \frac{1}{v_0} 0)$
4$e + 1$$e (= \frac{\partial v_2}{\partial x_2} + \frac{\partial v_3}{\partial x_2})$

Based on the above calculation, the $J_f$ at evaluation point $(e, 1)$ is:

$$J_f(e, 1) = (1 + \frac{1}{e}, e)$$

which matches our early analysis, i.e., $J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$.

Note that in the forward mode, we walk through each node twice. One for $x_1$, one for $x_2$. Actually, each time we walk through every node, we get a column in $J_f$. So, for $f: R^m \rightarrow R^n$, forward mode is preferred if $n > m$.

Reverse Mode

In the reverse mode, we first calculate $\frac{\partial y}{\partial v_4}$, then $\frac{\partial y}{\partial v_3}$, then $\frac{\partial y}{\partial v_2}$, then $\frac{\partial y}{\partial v_1}$, then $\frac{\partial y}{\partial v_0}$.

At evaluation point $(e, 1)$, we have

$i$$v_i$$\frac{\partial y}{\partial v_i}$
0$e$$1 + \frac{1}{e} (= \frac{\partial y}{\partial v_2} \frac{\partial v_2}{\partial v_0} + \frac{\partial y}{\partial v_3} \frac{\partial v_3}{\partial v_0} = \frac{\partial y}{\partial v_2} + \frac{\partial y}{\partial v_3} \frac{1}{e})$
1$1$$e (= \frac{\partial y}{\partial v_2} \frac{\partial v_2}{\partial v_1} = \frac{\partial y}{\partial v_2} v_0 )$
2$e$$1 (= \frac{\partial y}{\partial v_4} \frac{\partial v_4}{\partial v_2} = \frac{\partial y}{\partial v_4} ) $
3$1$$1 (= \frac{\partial y}{\partial v_4} \frac{\partial v_4}{\partial v_3} = \frac{\partial y}{\partial v_4} ) $
4$e + 1$$1 (= \frac{\partial y}{\partial v_4})$

so our Jacobian Matrix is

$$J_f(e, 1) = (1 + \frac{1}{e}, e)$$

which also matches our early analysis, i.e., $J_f(x_1, x_2) = (x_2 + \frac{1}{x_1}, x_1)$.

In the reverse mode, we walk through each node only once. Each time we walk through every node, we get a row in $J_f$. So, for $f: R^m \rightarrow R^n$, reverse mode is preferred if $n < m$.

In a real machine learning case, $m$ is typically way bigger than $n$. $m$ is the number of inputs we send to the model, while $n$ is the prediction. In most cases, $n = 1$, which makes backward pass way more suitable for machine learning.

Implementation

Based on the discussion on section Solutions, we know the graph computation reverse mode is most suitable for machine learning. Now, let’s discuss how we can implement that in python.

Overall, the overall idea of autograd implementation is

  • When grad function is called, it immediately returns a closure function, which will calculate the gradients.
  • The implementation of that function is going to be two steps:
    • Step 1: Trace the python code into computation graph.
    • Step 2: Run the backward pass to calculate gradients.

Next, we’ll discuss the details of the implementation of the closure function.

Trace The Python Code Into Computation Graph

To trace the python code into computation graph, we run the user python code via python interpreter directly. But instead of really executing those math functions (e.g., matmul), we override them and return graph nodes instead. Upstream nodes’ outputs will be downstream nodes’ inputs. By doing so, these nodes will form a graph, which is our computation graph.

The following graph shows a computation graph

Note that any control flow in the function is going be evaluated eagerly, meaning the computation graph won’t contain the control flow at all.

Run The Backward Pass

To run the backward pass, we just need to run the computation graph reversely.

Here each of the xxx_grad_fn returns its Jacobian matrix. In other words, they are the gradient functions.

The returned gradients are going to be used by the caller to update the gradients.