Neural Networks From Zero

Part 4 of the AI/LLM mastery series — the math, made clear. We build a neural network from a single neuron up, with real numbers and the actual equations: the weighted sum, activation functions, the forward pass, loss, gradient descent, the update rule, backpropagation, and the training loop — then scale it from six dials to the trillions inside a frontier model. No prior maths needed; every symbol defined.

AI/LLM Mastery · Part 4 of 20 — we open the black box. From a single neuron to a network that teaches itself, built from zero with real numbers and the actual equations — so that when the Transformer arrives, there is no magic left in it.

Opening the black box

For three parts we have leaned on a phrase and quietly hoped you would not ask too hard about it: the neural network, that “giant bank of little dials” which learns. In Part 1 it learned from data; in Part 3 it nudged word-vectors around a map of meaning. But what is a dial, really? What does “nudge” mean in actual arithmetic? Today we stop hand-waving and build the thing from the ground up.

This part is more mathematical than the others, on purpose — you asked to really understand it, and you cannot understand a learning machine without a little of its maths. But the deal from Part 1 still holds: every symbol gets defined the moment it appears, and we will work everything with concrete numbers, not abstractions. There is no calculus you need to bring with you. By the end, “the model trained” will mean something exact in your head, and you will even have read the few lines of code that do it.

The neuron: multiply, add, squish

Everything starts with one tiny unit, the neuron (the idea dates back to McCulloch and Pitts in 1943, who first modelled a brain cell as simple maths). A neuron does three things, in order: it takes some input numbers, combines them, and produces one output number. That is the whole job.

Walk through it with the numbers in the animation. Two inputs arrive: x1 = 0.5 and x2 = 0.9 — think of them as two pieces of evidence. Each input has a weight — a number saying how much that input matters: w1 = 0.8 (count input 1 strongly), w2 = -0.4 (input 2 actually argues the other way; a negative weight subtracts). The neuron multiplies each input by its weight, adds them together, then adds a bias b = 0.1 — an always-on nudge that shifts the result up or down. Finally it passes that total through an activation function, which we will meet in a moment. The weights and biases are the “dials”: the numbers the network will adjust as it learns.

The weighted sum, with real numbers

Let us do the combining step with full arithmetic, because it is the single most-repeated calculation in all of AI. The combination is called the weighted sum, written z:

the weighted sum
# z = sum of (each input times its weight), plus the bias
z = w1*x1 + w2*x2 + b

Now substitute the numbers: z = 0.8×0.5 + (-0.4)×0.9 + 0.1. Work out each piece: 0.8×0.5 = 0.40 (input 1 pushes the total up), -0.4×0.9 = -0.36 (input 2 pulls it down, because its weight is negative), and the bias adds 0.10. Add them: z = 0.40 - 0.36 + 0.10 = 0.14. That is it — a multiply-and-add. A real network does this same operation billions of times per forward pass, which is exactly why the chips that train AI are, at heart, giant multiply-and-add machines.

The activation: the bend that makes it powerful

Why squish the result at all? Because without it, a network would be powerless — and this is a subtle, important point. If every neuron only multiplied and added, then stacking layers of them would still only ever produce a straight line, no matter how many you stacked (adding straight lines gives a straight line). It could never bend to fit a curve, a letter shape, or the structure of language.

The fix is the activation function — a small non-linear bend applied to z. The most common today is ReLU (Rectified Linear Unit): f(z) = max(0, z) — keep the number if it is positive, otherwise make it zero. So our z = 0.14 stays 0.14; a negative sum would become 0. Another classic is the sigmoid, an S-shaped curve that squashes any number into the range 0 to 1 (handy when you want an output that behaves like a probability). That one little kink is what lets a big network bend and fold to fit almost any pattern — it is the difference between dumb arithmetic and a universal learner.

Layers and the forward pass

One neuron is not much. The power comes from wiring many together into layers: an input layer, one or more hidden layers in the middle, and an output layer. Every connection between neurons is its own weight — its own dial.

Pushing numbers through, left to right, is called the forward pass. Feed in 0.5 and 0.9 (for a language model these would be the embedding numbers from Part 3). Each hidden neuron does its own multiply-add-squish and produces a number — here 0.31 and 0.62. Those become the inputs to the output neuron, which does one final weighted sum and activation to produce the network’s prediction: 0.73. In code the whole neuron is almost laughably short:

a neuron, in Python
def neuron(inputs, weights, bias):
    z = sum(x*w for x, w in zip(inputs, weights)) + bias
    return max(0, z)        # ReLU activation

# a "forward pass" just calls neurons, layer by layer.

“Running” any neural network — from this toy to GPT — is just a forward pass like this. The prediction is whatever falls out the right-hand side. The only question left is the big one: where do good values for all those weights come from? Right now ours are random, so the prediction is meaningless. Training is the answer, and it has three moving parts: a way to measure error, a way to reduce it, and a way to spread the fix across every dial.

Loss: one number for how wrong

First, measure how wrong the prediction was. The network said 0.73; suppose the correct answer for this training example (its label) was 1.0. We need to turn that gap into a single number to minimise — the loss.

The simplest useful loss is the squared error: loss = (target - prediction)². Here that is (1.0 - 0.73)² = 0.27² = 0.073. Why square it? Two reasons, both practical. Squaring makes the result positive, so guessing too low is penalised the same as too high. And it punishes big mistakes far more than small ones (an error of 0.5 costs 0.25; an error of 1.0 costs a full 1.0), which pushes the network to fix its worst blunders first. Lower loss is better; zero is perfect. Training is just the relentless hunt to make this one number small.

Gradient descent: roll downhill

Now, how do we shrink the loss? Here is the central idea of the whole field, and it is wonderfully visual.

Imagine plotting the loss for every possible value of one weight. You get a landscape — usually a valley, with bad settings high up the sides and the best setting at the bottom. The network starts at some random spot, up a slope. At that spot we ask one question: which way is downhill? That direction is the gradient — the slope of the loss curve. (Strictly, the gradient points uphill, toward higher loss, so we step the opposite way; that is why the method is called gradient descent.) Take one small step downhill, then re-check the slope and step again. Each step lowers the loss a little, and the “ball” settles toward the valley floor — the weight setting with the least error.

The one idea to remember: training = roll downhill on the loss landscape. The gradient tells you which way is down; you take small steps until the loss stops falling.

The update rule: nudge every dial

“Take a small step downhill” is a precise equation, not a vibe. It is the update rule, and it is applied to every weight.

the update rule
# new weight = old weight  -  learning_rate * gradient
w_new = w - learning_rate * gradient

Take a single weight w = 0.80. Suppose the gradient for it is +0.6 — meaning “if w increases, the loss increases,” so we should move w down. The learning rate (often written with the Greek letter eta) is a small number, say 0.1, that sets the step size. Plug in: w_new = 0.80 - 0.1×0.6 = 0.80 - 0.06 = 0.74. The dial moved a deliberate, small amount in the direction that lowers loss. The learning rate matters a lot in practice: too large and you leap clean over the valley and never settle; too small and training takes forever. Apply this rule to every weight and bias in the network at once, and the whole thing inches toward better.

Backpropagation: spreading the blame

There is one gap left. The update rule needs the gradient for every weight, including ones buried deep in the hidden layers, far from the loss. Computing those efficiently is the trick that makes the whole thing possible — backpropagation.

The forward pass ran left to right and gave us the loss at the output. Backprop runs the error backwards — output → hidden → input — and at each connection asks: how much did you contribute to this mistake? A weight that strongly pushed the output the wrong way gets a large gradient (“move me a lot”); one that barely mattered gets a tiny one. Mathematically this is just the chain rule from calculus, applied layer by layer — but you do not need the calculus to hold the picture: blame flows backward, and each dial learns its share. Backprop, popularised for neural networks by Rumelhart, Hinton and Williams in 1986, is what makes training billions of weights actually feasible — it gets every gradient in a single backward sweep.

The training loop

Put the pieces together and you have the loop that trains every neural network on Earth. It is just four steps, repeated.

the entire training loop
for example in training_data:        # millions of them, many times over
    prediction = forward(example)     # 1. FORWARD  — make a guess
    loss       = squared_error(prediction, target)  # 2. LOSS — how wrong?
    gradients  = backward(loss)        # 3. BACKWARD — blame each dial (backprop)
    update(weights, gradients, lr)     # 4. UPDATE — nudge every dial downhill

Forward, loss, backward, update — then grab the next example and do it again, over millions of examples, many times through the dataset. The loss curve slides down and flattens as the network gets good, and its random dials slowly become a setting that actually solves the task. There is nothing more to it than this loop.

From six dials to a trillion

And here is the payoff for sitting through the maths: you now understand the engine of every modern AI, all the way up to the largest models in the world.

Our toy network has about six dials. Stack more neurons and layers and you reach thousands, then millions. GPT-3 had 175 billion weights; today’s frontier models reach into the trillions. Every one of them is the same neuron you met — multiply, add, squish — trained by the exact same forward/loss/backward/update loop, just on vast hardware for months. Scale changes the bill, not the idea.

But raw scale was not the breakthrough. Piling up plain layers does not, by itself, make a machine understand language — you need a clever arrangement of these neurons, one that lets the network decide which words should pay attention to which. That arrangement is the Transformer, and its key idea is called attention. Now that the neuron and the training loop hold no mystery for you, Part 5 begins building the architecture that actually made modern AI possible — and you will be able to see it for what it is: not magic, just a very clever wiring of the dials you now understand.

Reactions

Related Articles