Generating one token used to mean re-deriving the entire past from scratch — pure wasted heat. The KV cache stopped the recomputation; PagedAttention stopped wasting the memory the cache lives in, cutting waste from 60–80% to under 4%. Step through all three, one operation at a time.
Each idea fixes the bottleneck the previous one exposed. Read top to bottom — the story is cumulative, and so are the wins.
Hot cells = compute being spent. Cool cells = work reused from cache. Violet = paging indirection. Use Next / Prev, the dots, or autoplay.
Project to Q/K/V, score, scale, softmax, weight, and sum — for one focus token in a 4-token sequence.
The same generation, two memory strategies, side by side. Watch the work counters diverge.
Fixed-size blocks, a block table, on-demand allocation, and copy-on-write sharing — memory that no longer needs to be contiguous.
| aspect | no cache | kv cache | paged attention |
|---|
Every figure was cross-checked against a primary source during build. Framing matters — read what each one compares against before you quote it.