double-click a portal to enter

Watching a model learn to play.

The scene above acts out GRPO — the reinforcement-learning recipe behind today’s reasoning models — with chess as the task, trained endgame-first: games start a few plies from checkmate, where reward is close and credit assignment is easy, and the start slides back toward the opening as the policy improves — a reverse curriculum, the same trick that makes a hard gym bootstrappable. Each move plays one training step:

01

Sample

The policy π_θ samples a group of candidate moves for the position — a handful of rollouts, drawn from what it currently believes is good.

02

Reward

A verifier scores each rollout — and the game itself supplies the ground truth: checkmate pays +1. No human labels, just an outcome you can trust.

03

Advantage

Each rollout's advantage is its reward minus the group's mean. Better-than-average moves get reinforced; worse-than-average get suppressed.

04

Update

Credit flows back into the policy and its network re-wires — nodes jolt, weights shift, big early and finer as it converges. Then it plays the best move. Repeat, and it learns: early steps are near coin-tosses; watch the percentages sharpen as the counter climbs.

glow + %how strongly the policy favors that candidate — its sampling probability
gold, solidthe winner being reinforced — above the group's mean
rising emberssuppressed — below the mean, fading out
gold sparkthe update — credit flying into the policy, whose node-and-edge network then re-wires
checkered towerseach side's trajectory — past positions receding into history
checkmate · R = +1the outcome reward that closes every game

A town no single agent can run alone.

The scene above is a miniature of an agent gym, in the spirit of Stanford’s Generative Agents (“Smallville”) — but shaped to train and test agentic models, not just simulate life. Several LLM agents must cooperate over a long horizon, under disturbances, to keep the village fed. It’s the kind of multi-step, multi-agent task a single chat turn can’t measure:

01

A shared goal

One objective binds the town: keep everyone fed. Bread is the deliverable, and the children — who play when fed and slump when hungry — are the reward made human.

02

Divide & relay

No agent can do it alone. A farmer grows wheat, a miller grinds flour, a baker bakes — each depends on the others' output. Coordination, not a single smart move, is what wins.

03

Hedge or invest

Surplus is a choice: bank it in the granary as a reserve (insurance), or sell it for coins (growth). Hoard too little and a shock finds you exposed — a real long-horizon, risk-reward trade-off.

04

Disturbances

Then the wind drops and the mill stalls, or rabbits ruin the harvest. The granary cushions the blow — but if a bad stretch drains the reserve, the village goes hungry and has to recover.

the agentseach peg figure is an LLM agent — the glowing ring marks it
thought bubbletokens light up as an agent reasons out its next step
the chainwheat → flour → bread, relayed between them
green gaugethe reward — how well-fed the village is
sparklinethe reward's trailing history over the episode
episode countereach run is one episode — time-limit or starving out ends it, and the gym resets
orange calloutnames the disturbance currently hitting the chain
granarythe reserve that buffers shocks; coins = surplus sold
mill sailsturn with the wind; stall in a disturbance

Reconstruction by reverse diffusion.

The problem I worked on at Q.bio: scan far faster by measuring a fraction of k-space, then reconstruct with a diffusion model — reverse diffusion from noise, steered at every step by data consistency with the measurements, so the result is the real patient, not a plausible-looking fake:

01

Undersample

Acquire only a fraction of 3D k-space (the cube) — dense at the low-frequency centre, sparse outside. Below Nyquist, a plain inverse FFT would just alias.

02

Multimodal prior

The score model is a multimodal LLM-diffusion: a learned prior over anatomy, conditioned on language (the amber token stream) alongside the scan — far beyond hand-crafted sparsity.

03

Reverse diffusion

Start from pure noise and denoise step by step (t: T→0), letting the score model pull the estimate toward the image manifold.

04

Data consistency

At every step, project back onto the measured k-space (the loop). That anchors the sample to this patient — a reconstruction, not a hallucination — fast and faithful.

k-space cubethe 3D undersampled measurement (dense bright centre)
neural glyphthe score / denoiser prior
amber tokensmultimodal conditioning — language / report
looping particlesdata consistency — volume ⇄ k-space each step
brain volumethe 3D image denoising from noise → reconstruction
side bardiffusion time t — full at T (pure noise), drained at 0 (reconstructed)