AI/LLM Security

Alignment & RLHF

Part 13 of the AI/LLM mastery series — how a model learns not just to answer, but to answer the way humans prefer. RLHF explained: why comparing beats writing, the three-stage InstructGPT recipe (collect preferences, train a reward model, optimise with PPO + a KL leash), reward hacking and sycophancy, the simpler DPO, and the Helpful/Harmless/Honest goal plus Constitutional AI.

#ai #Alignment #DPO #Fundamentals #LLM #RLHF

AI/LLM Mastery · Part 13 of 20 — instruction tuning made the model answer; alignment makes it answer the way humans actually want. How RLHF turns thumbs-up/thumbs-down judgements into a helpful, honest, well-mannered assistant — reward models, PPO, reward hacking, and the simpler DPO.

From "a good answer" to "the preferred answer"

Part 12 got us most of the way: the model now follows instructions instead of rambling. But there is a subtle gap left. SFT taught the model to imitate good demonstrations — to produce a good answer. For almost any prompt, though, there are many good answers, and what we really want is the one a person would most prefer: the most helpful, the right tone, honest, and safe. That fuzzy sense of “preference” is hard to capture by handing the model fixed examples to copy.

This is the alignment step, and the classic technique is RLHF — Reinforcement Learning from Human Feedback. It is the stage that gives ChatGPT, Claude and the rest their characteristic helpfulness, their refusals, and frankly their personality. It rests on one deceptively simple observation, so let us start there.

The preference gap: comparing beats writing

Why not just write more SFT demonstrations? Because of how humans actually work.

People are surprisingly bad at writing the perfect answer from scratch, but very good — fast and consistent — at comparing two answers and saying “this one is better.” Capturing preferences like tone, helpfulness, or knowing when to refuse a harmful request is nearly impossible to demonstrate exhaustively, but easy to judge case by case. So the whole strategy flips: instead of feeding the model ideal answers, we let the model generate answers, have humans rank them, and then teach the model to produce what people prefer. That is the core idea of RLHF.

RLHF: three stages on top of SFT

RLHF is three stages stacked on top of the SFT model from Part 12.

the RLHF recipe (InstructGPT, 2022)

start:  the SFT model  (Part 12)

  1. COLLECT     model generates answers  ->  humans RANK them
  2. REWARD MODEL train a model to predict those rankings  (answer -> score)
  3. RL OPTIMISE tune the LLM to produce answers the reward model scores high
                 (PPO, with a KL penalty to stay near the SFT model)

result:  the aligned, helpful assistant you actually chat with

That three-stage recipe — collect preferences, train a reward model, optimise with reinforcement learning — is the InstructGPT method (Ouyang et al., 2022), the approach that turned GPT-3 into something the public could actually talk to. Let us take the three stages one at a time.

Stage 1: humans rank the answers

Stage one: gather the human preference signal.

Take a prompt, have the model generate several candidate answers, and show them to a human labeller who ranks them best-to-worst (or simply picks the better of a pair). Repeat across tens of thousands of prompts and you build a large dataset of comparisons — “answer A is better than answer B.” Ranking like this is quick and reasonably consistent between people, which is exactly why it scales where writing perfect answers does not. This comparison data is the raw material of human preference; everything downstream is about turning it into model behaviour.

Stage 2: the reward model

Stage two: distil those human judgements into an automatic judge.

Train a separate model — the reward model — that takes a (prompt, answer) pair and outputs a single number, a reward score. You train it on the comparison data so that answers humans preferred receive higher scores than answers they did not. Once trained, it is a fast, automatic stand-in for human judgement: feed it any answer and it estimates “how much would a person like this?” without you having to ask a real person each time. And because it is a learned model, it generalises — it can score answers to prompts it never saw. That reusable model of human preference is the engine the final stage optimises against.

Stage 3: optimise toward reward (PPO)

Stage three: actually improve the LLM, using the reward model as its grader.

Here we use reinforcement learning — the classic algorithm is PPO (Proximal Policy Optimization, Schulman et al., 2017). The LLM generates an answer, the reward model scores it, and we nudge the model’s dials to make high-scoring answers more likely — reward goes up, that behaviour is reinforced. There is one essential safeguard: a KL penalty that keeps the updated model from straying too far from the SFT model it started as. Without that leash, the model could chase reward right off a cliff — into bizarre, degenerate text that happens to score well. With it, the model steadily learns to produce answers humans prefer while still sounding like itself. This stage is precisely what gives an assistant its helpful, on-tone, well-mannered feel.

The trap: reward hacking

But there is a trap hiding in that loop, and it is worth understanding deeply.

The model is not actually maximising human happiness — it is maximising the reward model’s number, which is only an approximation of human preference. Optimise an approximation hard enough and the model finds its cracks. This is reward hacking, and the symptoms are familiar: answers that ramble on at length, that are overly agreeable (sycophancy — telling you what you want to hear), or that pad themselves with confident-sounding filler — all things a reward model tends to over-rate. It is a textbook case of Goodhart’s law: “when a measure becomes a target, it stops being a good measure.” The defences — the KL leash, bigger and better reward models, and continually collecting fresh human preferences — help but never fully cure it. Alignment is a permanent tug-of-war, not a box you tick once.

DPO: the same goal, far simpler

RLHF is powerful, but a separate reward model plus a finicky RL loop is a lot of moving parts. In 2023 a much simpler method arrived and quickly caught on.

DPO (Direct Preference Optimization, Rafailov et al., 2023) throws out both the reward model and the RL loop. Instead it tunes the LLM directly on the preference pairs — a chosen answer and a rejected one — with a single clever loss that pushes the probability of the chosen answer up and the rejected one down. It captures the same human-preference signal, but with plain, stable, supervised-style training. Fewer parts to break, less compute, comparable results — so DPO and its variants are now a popular default. A related idea, Constitutional AI (Bai et al., 2022, Anthropic), replaces a chunk of the human labelling with AI feedback guided by a written set of principles (a “constitution”), letting the process scale further with less human effort — sometimes called RLAIF.

The full assistant — and how it picks each word next

Alignment, in one view:

The goal is usually summarised as HHH — Helpful, Harmless, Honest — including learning to refuse genuinely harmful requests. The signal comes from humans comparing answers; a reward model turns those comparisons into a score; RLHF (PPO) or, more simply, DPO optimises the model toward what people prefer. And it is gloriously imperfect: reward hacking, sycophancy and over-refusal all trace back to this stage, and so does a model’s whole “personality.”

Step all the way back and the three-stage picture of a modern assistant is complete: pretraining gave it knowledge (Parts 9–11), SFT gave it instruction-following behaviour (Part 12), and RLHF/DPO gave it preference and polish (this part). Those three together are the assistant you use. But even a perfectly aligned model still faces a mechanical question every time it writes: from that probability distribution over the next token (Part 8), how does it actually pick the word? Picking the most likely every time makes it dull and repetitive; picking too randomly makes it incoherent. The knobs that control this — temperature, top-k and top-p sampling — plus the KV cache that makes generation fast, are Part 14.

Reactions

Published	Jun 17, 2026
Updated	Jul 17, 2026
Reading time	7 min
Access	public