Fundamentals

This note discusses the fundamentals of machine learning.

What Is The Goal Of A Machine Learning Model?

Let’s assume we have a machine learning model $f$. $x_i$ is its inputs, $w_i$ is its paramemeters, and $y$ is its prediction. We have

$$y = f_{model}(x_1, x_2, x_3, …, x_m, w_1, w_2, w_3, …, w_k)$$

The goal of machine learning is to tune $w_i$ in order to predict the $y$ that is close enough to the real label. To quantize the difference, we use a loss function.

$$L = l(y - \tilde{y})$$

So mathmatically, we want to minimize $L$.

How Do We Train A Model?

In order to minimize the loss $L$, we’ll need to figure out how to tune all the parameters in the model, i.e., $w_i$.

There are many ways we can use to tune the parameters.

  • If our model function $f$ is simple enough, e.g., a linear function like $f(x) = w x + b$, we might be able to calculate the theoretic values of $w_i$ to minimize the loss $L$. This solution only works if $f$ is simple enough.
  • We can ramdomly assign initial values to $w_i$, and then randomly adjust each $w_i$ by adding a small delta to see if $L$ goes smaller. If $L$ goes smaller, we take the change and continue trying to update the variables until we are not able to make any improvement. This can apply to a general model function $f$, but the parameter tuning is not efficient enough.

However, these solutions would not work in case of neural networks, since neural networks are typically complicated enough, and require more efficient parameter tuning strategies. In the world of neural networks, the most effective ways to tune parameters today is backward propagation with gradient descent.

The idea behind it is simple, if we can calculate the derivative of loss $L$ with respect to $w_i$, (i.e., the gradient, $g_i$), we can use $g_i$ to tune $w_i$. We can simply update $w_i$ based on $g_i$.

$$w_i = w_i - \alpha \times \frac{\partial L}{\partial w_i}$$

Here $\alpha$ is a small number close to $0$, which is typically called learning rate. We use that to make sure we don’t over tune the parameters.

A Linear Regression Model

Let’s use the above idea to build a linear regression model.

$$y = w x + b$$ $$L = (y - \tilde{y})^2$$

There are two paremeters in this model, $w$ and $b$. First, let’s calculate the partial derivatives of $L$ with respect to each of them.

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial w} = 2 (y - \tilde{y}) x$$ $$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial b} = 2 (y - \tilde{y})$$

Now, we can implement the linear regression model based on that.

import random

learning_rate = 0.0001
epochs = 10

# Define our model.
w, b = 0, 0
def f(x, w, b):
    return x * w + b

# Generating training and eval datasets.
target_w, target_b = 2, 1

training_data = []
training_data_size = 10000

eval_data = []
eval_data_size = 100

for i in range(training_data_size):
    x = random.randrange(0, 100)
    y = f(x, target_w, target_b)
    training_data.append((x, y))

for i in range(eval_data_size):
    x = random.randrange(0, 100)
    y = f(x, target_w, target_b)
    eval_data.append((x, y))

# Train our model.
print("Training model...")
for epoch in range(epochs):
    losses = []
    for x, y in training_data:
        # Run forward path.
        p = f(x, w, b)
        loss = (p - y) ** 2
        losses.append(loss)

        # Run backward path.
        w -= learning_rate * 2 * (p - y) * x
        b -= learning_rate * 2 * (p - y)

    print("Epoch {}: loss={:.6f} (w={:.3f}, b={:.3f})".format(
        epoch, sum(losses) * 1.0 / len(losses), w, b
    ))

# Evaluate our model.
print("\nEvaluating model...")
losses = []
for x, y in eval_data:
    # Run forward path.
    p = f(x, w, b)
    loss = (p - y) ** 2
    losses.append(loss)

print("Final loss is: {:.6f}".format(sum(losses) * 1.0 / len(losses)))

Running the above code will give us:

Training model...
Epoch 0: loss=5.941978 (w=2.011, b=0.402)
Epoch 1: loss=0.074746 (w=2.007, b=0.637)
Epoch 2: loss=0.027469 (w=2.004, b=0.780)
Epoch 3: loss=0.010095 (w=2.002, b=0.867)
Epoch 4: loss=0.003710 (w=2.001, b=0.919)
Epoch 5: loss=0.001363 (w=2.001, b=0.951)
Epoch 6: loss=0.000501 (w=2.001, b=0.970)
Epoch 7: loss=0.000184 (w=2.000, b=0.982)
Epoch 8: loss=0.000068 (w=2.000, b=0.989)
Epoch 9: loss=0.000025 (w=2.000, b=0.993)

Evaluating model...
Final loss is: 0.000014

This training took 10 epochs. A epoch here means using the full dataset to train the model once. As you can see, the loss becomes really small after a few epochs of training. The parameter $w$ and $b$ got tuned to $2.0$ and $0.993$, which was very close to the target values, i.e., $2.0$ and $1.0$.

What Is Missing?

Although the above code implements the idea we talked about this note, the implementation is not practical for a decent size model. The gaps are:

  • No autograd: For large models, we typically want to auto calculate the gradients (e.g., $\frac{\partial L}{\partial w_i}$). We don’t have a solution to that yet.
  • No distributed training: The whole training process above takes one CPU core on a single machine. Large models typically require more training resources.
  • No GPU/TPU support: Modern ML model frameworks leverage hardward accelerators, e.g., GPU/TPU, which makes training a lot faster.

These are going to be our topics next.