Fundamentals

This note discusses the fundamentals of machine learning.

What Is The Goal Of A Machine Learning Model?

Let’s assume we have a machine learning model ff. xix_i is its inputs, wiw_i is its paramemeters, and yy is its prediction. We have

y=fmodel(x1,x2,x3,,xm,w1,w2,w3,,wk)y = f_{model}(x_1, x_2, x_3, …, x_m, w_1, w_2, w_3, …, w_k)

The goal of machine learning is to tune wiw_i in order to predict the yy that is close enough to the real label. To quantize the difference, we use a loss function.

L=l(yy~)L = l(y - \tilde{y})

So mathmatically, we want to minimize LL.

How Do We Train A Model?

In order to minimize the loss LL, we’ll need to figure out how to tune all the parameters in the model, i.e., wiw_i.

There are many ways we can use to tune the parameters.

  • If our model function ff is simple enough, e.g., a linear function like f(x)=wx+bf(x) = w x + b, we might be able to calculate the theoretic values of wiw_i to minimize the loss LL. This solution only works if ff is simple enough.
  • We can ramdomly assign initial values to wiw_i, and then randomly adjust each wiw_i by adding a small delta to see if LL goes smaller. If LL goes smaller, we take the change and continue trying to update the variables until we are not able to make any improvement. This can apply to a general model function ff, but the parameter tuning is not efficient enough.

However, these solutions would not work in case of neural networks, since neural networks are typically complicated enough, and require more efficient parameter tuning strategies. In the world of neural networks, the most effective ways to tune parameters today is backward propagation with gradient descent.

The idea behind it is simple, if we can calculate the derivative of loss LL with respect to wiw_i, (i.e., the gradient, gig_i), we can use gig_i to tune wiw_i. We can simply update wiw_i based on gig_i.

wi=wiα×Lwiw_i = w_i - \alpha \times \frac{\partial L}{\partial w_i}

Here α\alpha is a small number close to 00, which is typically called learning rate. We use that to make sure we don’t over tune the parameters.

A Linear Regression Model

Let’s use the above idea to build a linear regression model.

y=wx+by = w x + b L=(yy~)2L = (y - \tilde{y})^2

There are two paremeters in this model, ww and bb. First, let’s calculate the partial derivatives of LL with respect to each of them.

Lw=Lyyw=2(yy~)x\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial w} = 2 (y - \tilde{y}) x Lb=Lyyb=2(yy~)\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial b} = 2 (y - \tilde{y})

Now, we can implement the linear regression model based on that.

import random

learning_rate = 0.0001
epochs = 10

# Define our model.
w, b = 0, 0
def f(x, w, b):
    return x * w + b

# Generating training and eval datasets.
target_w, target_b = 2, 1

training_data = []
training_data_size = 10000

eval_data = []
eval_data_size = 100

for i in range(training_data_size):
    x = random.randrange(0, 100)
    y = f(x, target_w, target_b)
    training_data.append((x, y))

for i in range(eval_data_size):
    x = random.randrange(0, 100)
    y = f(x, target_w, target_b)
    eval_data.append((x, y))

# Train our model.
print("Training model...")
for epoch in range(epochs):
    losses = []
    for x, y in training_data:
        # Run forward path.
        p = f(x, w, b)
        loss = (p - y) ** 2
        losses.append(loss)

        # Run backward path.
        w -= learning_rate * 2 * (p - y) * x
        b -= learning_rate * 2 * (p - y)

    print("Epoch {}: loss={:.6f} (w={:.3f}, b={:.3f})".format(
        epoch, sum(losses) * 1.0 / len(losses), w, b
    ))

# Evaluate our model.
print("\nEvaluating model...")
losses = []
for x, y in eval_data:
    # Run forward path.
    p = f(x, w, b)
    loss = (p - y) ** 2
    losses.append(loss)

print("Final loss is: {:.6f}".format(sum(losses) * 1.0 / len(losses)))

Running the above code will give us:

Training model...
Epoch 0: loss=5.941978 (w=2.011, b=0.402)
Epoch 1: loss=0.074746 (w=2.007, b=0.637)
Epoch 2: loss=0.027469 (w=2.004, b=0.780)
Epoch 3: loss=0.010095 (w=2.002, b=0.867)
Epoch 4: loss=0.003710 (w=2.001, b=0.919)
Epoch 5: loss=0.001363 (w=2.001, b=0.951)
Epoch 6: loss=0.000501 (w=2.001, b=0.970)
Epoch 7: loss=0.000184 (w=2.000, b=0.982)
Epoch 8: loss=0.000068 (w=2.000, b=0.989)
Epoch 9: loss=0.000025 (w=2.000, b=0.993)

Evaluating model...
Final loss is: 0.000014

This training took 10 epochs. A epoch here means using the full dataset to train the model once. As you can see, the loss becomes really small after a few epochs of training. The parameter ww and bb got tuned to 2.02.0 and 0.9930.993, which was very close to the target values, i.e., 2.02.0 and 1.01.0.

What Is Missing?

Although the above code implements the idea we talked about this note, the implementation is not practical for a decent size model. The gaps are:

  • No autograd: For large models, we typically want to auto calculate the gradients (e.g., Lwi\frac{\partial L}{\partial w_i}). We don’t have a solution to that yet.
  • No distributed training: The whole training process above takes one CPU core on a single machine. Large models typically require more training resources.
  • No GPU/TPU support: Modern ML model frameworks leverage hardward accelerators, e.g., GPU/TPU, which makes training a lot faster.

These are going to be our topics next.