01 · RL ROLLOUT
Watching a model learn to play.
The scene above acts out GRPO — the reinforcement-learning recipe behind today’s reasoning models — with chess as the task, trained endgame-first: games start a few plies from checkmate, where reward is close and credit assignment is easy, and the start slides back toward the opening as the policy improves — a reverse curriculum, the same trick that makes a hard gym bootstrappable. Each move plays one training step:
01
Sample
The policy π_θ samples a group of candidate moves for the position — a handful of rollouts, drawn from what it currently believes is good.
02
Reward
A verifier scores each rollout — and the game itself supplies the ground truth: checkmate pays +1. No human labels, just an outcome you can trust.
03
Advantage
Each rollout's advantage is its reward minus the group's mean. Better-than-average moves get reinforced; worse-than-average get suppressed.
04
Update
Credit flows back into the policy and its network re-wires — nodes jolt, weights shift, big early and finer as it converges. Then it plays the best move. Repeat, and it learns: early steps are near coin-tosses; watch the percentages sharpen as the counter climbs.
READING THE LIGHT
02 · AGENT GYM
A town no single agent can run alone.
The scene above is a miniature of an agent gym, in the spirit of Stanford’s Generative Agents (“Smallville”) — but shaped to train and test agentic models, not just simulate life. Several LLM agents must cooperate over a long horizon, under disturbances, to keep the village fed. It’s the kind of multi-step, multi-agent task a single chat turn can’t measure:
01
A shared goal
One objective binds the town: keep everyone fed. Bread is the deliverable, and the children — who play when fed and slump when hungry — are the reward made human.
02
Divide & relay
No agent can do it alone. A farmer grows wheat, a miller grinds flour, a baker bakes — each depends on the others' output. Coordination, not a single smart move, is what wins.
03
Hedge or invest
Surplus is a choice: bank it in the granary as a reserve (insurance), or sell it for coins (growth). Hoard too little and a shock finds you exposed — a real long-horizon, risk-reward trade-off.
04
Disturbances
Then the wind drops and the mill stalls, or rabbits ruin the harvest. The granary cushions the blow — but if a bad stretch drains the reserve, the village goes hungry and has to recover.
READING THE VILLAGE
03 · DIFFUSION MRI RECONSTRUCTION
Reconstruction by reverse diffusion.
The problem I worked on at Q.bio: scan far faster by measuring a fraction of k-space, then reconstruct with a diffusion model — reverse diffusion from noise, steered at every step by data consistency with the measurements, so the result is the real patient, not a plausible-looking fake:
01
Undersample
Acquire only a fraction of 3D k-space (the cube) — dense at the low-frequency centre, sparse outside. Below Nyquist, a plain inverse FFT would just alias.
02
Multimodal prior
The score model is a multimodal LLM-diffusion: a learned prior over anatomy, conditioned on language (the amber token stream) alongside the scan — far beyond hand-crafted sparsity.
03
Reverse diffusion
Start from pure noise and denoise step by step (t: T→0), letting the score model pull the estimate toward the image manifold.
04
Data consistency
At every step, project back onto the measured k-space (the loop). That anchors the sample to this patient — a reconstruction, not a hallucination — fast and faithful.
READING THE SCAN