1Cademy - Key-Value (KV) Cache in Transformer Inference

Learn Before

Computational Cost of Self-Attention in Transformers
Global Nature of Standard Transformer LLMs
Auto-Regressive Generation Process
Reusability of Key-Value Pairs in Autoregressive Inference

Concept

Key-Value (KV) Cache in Transformer Inference

The Key-Value (KV) cache is a crucial component for efficient autoregressive inference in Transformer models. It functions as a memory store for the key and value vectors representing all previously processed tokens. At each generation step, instead of recomputing these vectors for the entire preceding sequence, the model generates a new representation for the current token and has it attend to the historical representations stored in the cache. This mechanism of storing and reusing past context significantly improves inference speed and is fundamental to the model's operation.

Updated 2026-05-02

Contributors are:

Who are from:

Learn After

Space Complexity of the KV Cache
Updating the KV Cache
Two-Phase Inference from a KV Cache Perspective
Single-Step Generation with a KV Cache
Memory Allocation for KV Caching in Standard Self-Attention
Multi-Dimensional Structure of the KV Cache
An autoregressive language model generates text one word at a time. To generate the 100th word, it must relate it to all 99 previous words. A common optimization involves storing in memory the intermediate representations for each of the first 99 words as they are generated.

Which statement best analyzes the primary computational advantage of this optimization compared to re-computing everything from scratch at step 100?
Chatbot Performance Degradation
Computational Steps in Cached Inference
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
You run an internal LLM inference service for empl...
Your company’s internal LLM service handles many c...
You operate a GPU-backed LLM service that uses con...
You’re on-call for an internal LLM chat service. M...

Learn Before

Related

Learn After