attention / kv-cache → paged
LLM inference · interactive walkthrough · figures verified

Attention got
expensive. Then it got cool.

Generating one token used to mean re-deriving the entire past from scratch — pure wasted heat. The KV cache stopped the recomputation; PagedAttention stopped wasting the memory the cache lives in, cutting waste from 60–80% to under 4%. Step through all three, one operation at a time.

▶ Step through the demos Read the concepts
// autoregressive decode — watch each new token's cost
decode →
Every step recomputes all previous keys & values.
work: 0 units
The five ideas

From a matrix multiply to a memory manager

Each idea fixes the bottleneck the previous one exposed. Read top to bottom — the story is cumulative, and so are the wins.

Interactive · step-by-step

Three instruments. One thermal language.

Hot cells = compute being spent. Cool cells = work reused from cache. Violet = paging indirection. Use Next / Prev, the dots, or autoplay.

01 · the baseline

Standard self-attention, one forward pass

Project to Q/K/V, score, scale, softmax, weight, and sum — for one focus token in a 4-token sequence.

02 · the speedup

Decoding: recompute-all vs. append-one

The same generation, two memory strategies, side by side. Watch the work counters diverge.

03 · the memory fix

PagedAttention — the KV cache as virtual memory

Fixed-size blocks, a block table, on-demand allocation, and copy-on-write sharing — memory that no longer needs to be contiguous.

Side by side

What each step actually buys you

aspect no cache kv cache paged attention
Verified figures

The payoff, in numbers

Every figure was cross-checked against a primary source during build. Framing matters — read what each one compares against before you quote it.

Reference

Glossary