Lab — Yuhua (Bill) Chen

01 · RL ROLLOUT

Watching a model learn to play.

The scene above acts out GRPO — the reinforcement-learning recipe behind today’s reasoning models — with chess as the task, trained endgame-first: games start a few plies from checkmate, where reward is close and credit assignment is easy, and the start slides back toward the opening as the policy improves — a reverse curriculum, the same trick that makes a hard gym bootstrappable. Each move plays one training step:

Sample

The policy π_θ samples a group of candidate moves for the position — a handful of rollouts, drawn from what it currently believes is good.

Reward

A verifier scores each rollout — and the game itself supplies the ground truth: checkmate pays +1. No human labels, just an outcome you can trust.

Advantage

Each rollout's advantage is its reward minus the group's mean. Better-than-average moves get reinforced; worse-than-average get suppressed.

Update

Credit flows back into the policy and its network re-wires — nodes jolt, weights shift, big early and finer as it converges. Then it plays the best move. Repeat, and it learns: early steps are near coin-tosses; watch the percentages sharpen as the counter climbs.

READING THE LIGHT

glow + %— how strongly the policy favors that candidate — its sampling probability

gold, solid— the winner being reinforced — above the group's mean

rising embers— suppressed — below the mean, fading out

gold spark— the update — credit flying into the policy, whose node-and-edge network then re-wires

checkered towers— each side's trajectory — past positions receding into history

checkmate · R = +1— the outcome reward that closes every game

02 · AGENT GYM

A town no single agent can run alone.

The scene above is a miniature of an agent gym, in the spirit of Stanford’s Generative Agents (“Smallville”) — but shaped to train and test agentic models, not just simulate life. Several LLM agents must cooperate over a long horizon, under disturbances, to keep the village fed. It’s the kind of multi-step, multi-agent task a single chat turn can’t measure:

A shared goal

One objective binds the town: keep everyone fed. Bread is the deliverable, and the children — who play when fed and slump when hungry — are the reward made human.

Divide & relay

No agent can do it alone. A farmer grows wheat, a miller grinds flour, a baker bakes — each depends on the others' output. Coordination, not a single smart move, is what wins.

Hedge or invest

Surplus is a choice: bank it in the granary as a reserve (insurance), or sell it for coins (growth). Hoard too little and a shock finds you exposed — a real long-horizon, risk-reward trade-off.

Disturbances

Then the wind drops and the mill stalls, or rabbits ruin the harvest. The granary cushions the blow — but if a bad stretch drains the reserve, the village goes hungry and has to recover.

READING THE VILLAGE

the agents— each peg figure is an LLM agent — the glowing ring marks it

thought bubble— tokens light up as an agent reasons out its next step

the chain— wheat → flour → bread, relayed between them

green gauge— the reward — how well-fed the village is

sparkline— the reward's trailing history over the episode

episode counter— each run is one episode — time-limit or starving out ends it, and the gym resets

orange callout— names the disturbance currently hitting the chain

granary— the reserve that buffers shocks; coins = surplus sold

mill sails— turn with the wind; stall in a disturbance

03 · DIFFUSION MRI RECONSTRUCTION

Reconstruction by reverse diffusion.

The problem I worked on at Q.bio: scan far faster by measuring a fraction of k-space, then reconstruct with a diffusion model — reverse diffusion from noise, steered at every step by data consistency with the measurements, so the result is the real patient, not a plausible-looking fake:

Undersample

Acquire only a fraction of 3D k-space (the cube) — dense at the low-frequency centre, sparse outside. Below Nyquist, a plain inverse FFT would just alias.

Multimodal prior

The score model is a multimodal LLM-diffusion: a learned prior over anatomy, conditioned on language (the amber token stream) alongside the scan — far beyond hand-crafted sparsity.

Reverse diffusion

Start from pure noise and denoise step by step (t: T→0), letting the score model pull the estimate toward the image manifold.

Data consistency

At every step, project back onto the measured k-space (the loop). That anchors the sample to this patient — a reconstruction, not a hallucination — fast and faithful.

READING THE SCAN

k-space cube— the 3D undersampled measurement (dense bright centre)

neural glyph— the score / denoiser prior

amber tokens— multimodal conditioning — language / report

looping particles— data consistency — volume ⇄ k-space each step

brain volume— the 3D image denoising from noise → reconstruction

side bar— diffusion time t — full at T (pure noise), drained at 0 (reconstructed)