LLM inference · interactive walkthrough · figures verified

Attention got
expensive. Then it got cool.

Generating one token used to mean re-deriving the entire past from scratch — pure wasted heat. The KV cache stopped the recomputation; PagedAttention stopped wasting the memory the cache lives in, cutting waste from 60–80% to under 4%. Step through all three, one operation at a time.

▶ Step through the demos Read the concepts

// autoregressive decode — watch each new token's cost

decode →

Every step recomputes all previous keys & values.

work: 0 units

The five ideas

From a matrix multiply to a memory manager

Each idea fixes the bottleneck the previous one exposed. Read top to bottom — the story is cumulative, and so are the wins.

Interactive · step-by-step

Three instruments. One thermal language.

Hot cells = compute being spent. Cool cells = work reused from cache. Violet = paging indirection. Use Next / Prev, the dots, or autoplay.

01 · the baseline

Standard self-attention, one forward pass

Project to Q/K/V, score, scale, softmax, weight, and sum — for one focus token in a 4-token sequence.

02 · the speedup

Decoding: recompute-all vs. append-one

The same generation, two memory strategies, side by side. Watch the work counters diverge.

03 · the memory fix

PagedAttention — the KV cache as virtual memory

Fixed-size blocks, a block table, on-demand allocation, and copy-on-write sharing — memory that no longer needs to be contiguous.

Verified figures

The payoff, in numbers

Every figure was cross-checked against a primary source during build. Framing matters — read what each one compares against before you quote it.

Attention got
expensive. Then it got cool.

From a matrix multiply to a memory manager

Three instruments. One thermal language.

Standard self-attention, one forward pass

Decoding: recompute-all vs. append-one

PagedAttention — the KV cache as virtual memory

What each step actually buys you

The payoff, in numbers

Glossary

Attention gotexpensive. Then it got cool.

From a matrix multiply to a memory manager

Three instruments. One thermal language.

Standard self-attention, one forward pass

Decoding: recompute-all vs. append-one

PagedAttention — the KV cache as virtual memory

What each step actually buys you

The payoff, in numbers

Glossary

Attention got
expensive. Then it got cool.