The Attention Bottleneck: How Modern LLMs Solved a Problem That Nearly Broke the Transformer
Every modern large language model — GPT-4, Llama 3, Gemini, Mistral — is a transformer. Every transformer is built around attention. But the original mechanism from “Attention Is All You Need” (Vaswani et al., 2017) cannot scale to those lengths. No GPU that exists today can run it at 128K tokens.
The math makes the problem concrete. The attention matrix for a single layer has entries, where is sequence length. At tokens in FP16, that matrix occupies roughly 2 GB of GPU memory — for one layer. With 32 layers, attention matrices alone require 64 GB. The H100, the most powerful production GPU available, has 80 GB of HBM in total.
This post traces how the field solved that problem — not once, but four separate times, each addressing a different bottleneck. The variants covered here are not academic curiosities. They are prerequisites for every LLM running at scale today.
The Baseline: Scaled Dot-Product Attention
Before examining the variants, we need a precise definition of what they are varying from. Scaled dot-product attention takes three matrices as input — queries , keys , and values — and produces a weighted sum of values:
The scaling prevents dot products from growing large in magnitude, which would push softmax into regions with near-zero gradients.
Multi-head attention (MHA) runs independent attention computations in parallel, each projecting into a lower-dimensional subspace:
where , , and .
Complexity
Computing produces an matrix. Time complexity is ; memory to store the attention matrix is . This quadratic dependence on is the root cause of every problem in the sections that follow.
The KV Cache
During autoregressive inference — generating one token at a time — the model must recompute keys and values for every previous token at each step unless they are cached. In practice, they are always cached: after computing and for position , they are stored and reused for all future positions.
The cache size per transformer layer is:
For a 70B-parameter model with heads, , FP16 (2 bytes), and :
With 80 layers: 80 GB — the entire memory of an H100, before weights, activations, or any other state. The KV cache is the first wall.
Problem 1: The KV Cache Explodes
The KV cache grows with every attention head — and standard MHA has a lot of heads. The fix is to ask: do all those heads actually need their own keys and values?
Multi-Query Attention
Multi-Query Attention (MQA) (Shazeer, 2019) answers no. It keeps query heads but collapses keys and values to a single shared head:
A single and replaces the separate projections. The KV cache shrinks by . For , that is a 64× memory reduction at inference time.
The quality cost is real but small. Shazeer found perplexity increases of roughly 1–2% on language modeling tasks — acceptable for most applications, especially when the alternative is running out of memory.
Group Query Attention
Group Query Attention (GQA) (Ainslie et al., 2023) generalizes MQA. Rather than collapsing to one K/V head, it creates groups. Each group of query heads shares one K/V head:
where maps each query head to its group.
MHA is GQA with . MQA is GQA with . GQA interpolates between them:
| Variant | K/V heads | Cache vs. MHA | Quality vs. MHA |
|---|---|---|---|
| MHA | Baseline | ||
| GQA ( groups) | smaller | Near-identical | |
| MQA | smaller | Small degradation |
Figure 2: How MHA, MQA, and GQA differ in their key/value head structure. Circles = query heads, rectangles = K/V head pairs.
Uptraining from MHA: To convert an existing MHA checkpoint to GQA, Ainslie et al. propose mean-pooling the K/V head projections within each group to initialize the shared GQA head, then continuing training for a short period. This avoids training GQA models from scratch.
GQA is now the default in most production LLMs: Llama 2 70B, Llama 3, Mistral 7B, and Gemma all use it.
Problem 2: The Sequence Length Wall
GQA and MQA reduce the KV cache. They do not reduce the cost of computing attention itself. The matrix is still . At tokens, that is entries — approximately 20 GB at FP16, per layer, before any K/V cache optimizations apply.
The question becomes: does every token need to attend to every other token?
Sparse Attention
Sparse Transformer (Child et al., 2019) applies a binary mask to restrict which positions attend to each other:
where is zero at allowed positions and at masked positions (which become zero after softmax). Three patterns proved most useful:
- Local window: iff . Each token attends to its nearest neighbors.
- Strided: iff . Every -th token is globally visible.
- Combined: local + strided, covering pairs instead of .
Sliding Window Attention
Sliding Window Attention, used in Longformer (Beltagy et al., 2020) and Mistral 7B (Jiang et al., 2023), is the causal special case of local windowing: each token attends only to the most recent positions:
Complexity drops from to .
Effective receptive field across layers: Although each layer sees only a window of size , information propagates across layers. A token at position can receive information from position in layers. With layers stacked, the effective receptive field is .
Mistral 7B uses with transformer layers:
This is why Mistral achieves strong long-context performance despite attending to a small local window per layer.
Figure 1: Attention patterns for an 8-token sequence. Dark cells indicate attended positions.
Problem 3: Even Has Limits
Sparse and sliding window patterns reduce the constant but do not change the complexity class. For very long sequences — or tasks where important context is globally distributed — fixed sparsity patterns miss signal. The deeper question is: can we reduce the complexity of attention from to ?
The Kernel Decomposition
The obstacle is the softmax. Written out explicitly, the -th output of attention is:
The denominator sums over all — the positions are coupled. You cannot compute the outputs independently.
Linear attention replaces the exponential kernel with a decomposable kernel for some feature map :
Now factor the computation. Define:
Compute and once in time. Then each query is: in time. Total: — linear in sequence length.
Linear Transformer
Linear Transformer (Katharopoulos et al., 2020) uses , which ensures positivity (required for the kernel interpretation). The causal variant accumulates and as prefix sums, making it equivalent to an RNN — enabling per-step inference once the recurrence is unrolled.
Performer (FAVOR+)
Rather than replacing softmax with an arbitrary kernel, Performer (Choromanski et al., 2020) approximates the softmax kernel itself using random features (FAVOR+: Fast Attention Via positive Orthogonal Random features):
where for random directions drawn as orthogonal vectors. Orthogonality reduces estimator variance by approximately compared to i.i.d. sampling.
Quality Trade-off
Linear attention approximates softmax — it loses the sharp, peaked attention distributions that standard attention learns. For tasks requiring precise token recall (e.g. copying a specific value from earlier in the context), the approximation gap is measurable. For tasks that aggregate information over long spans, linear attention is often competitive with standard attention at a fraction of the compute.
Problem 4: The GPU I/O Wall
By 2022, practitioners had sparse and linear attention. Yet profiling showed standard attention was still slow. The GPU’s tensor cores were sitting idle. The bottleneck was not arithmetic — it was memory bandwidth. Moving data between memory tiers dominated runtime.
The GPU Memory Hierarchy
Modern GPUs have two relevant memory tiers:
- SRAM (shared memory, on-chip): ~20 MB on an A100, bandwidth ~19 TB/s
- HBM (high-bandwidth memory, off-chip): 40–80 GB on an A100, bandwidth ~2 TB/s
SRAM is roughly 10× faster than HBM but 2,000× smaller. Standard attention reads , , from HBM and computes . It writes the result to HBM, reads it back for softmax, then reads it again for the multiplication. That is three round-trips over data — dominated by HBM bandwidth, not arithmetic.
Flash Attention
Flash Attention (Dao et al., 2022) achieves the same mathematical output as standard attention while never materializing the full matrix in HBM. It does this with three ideas:
1. Tiling. Partition , , into blocks of size that fit in SRAM. Process one block at a time, keeping all intermediate values on-chip.
2. Online softmax. Softmax over a full row requires seeing all scores first. The online softmax algorithm computes a numerically stable result using running statistics. For each new block of key-value pairs, update:
After all blocks: .
This produces the exact same result as computing softmax over all scores at once.
3. Recomputation. The backward pass normally needs the attention matrix to compute gradients. Flash Attention discards it and recomputes from the saved output and softmax statistics during backprop. This trades extra FLOPs for drastically less HBM traffic.
Result: IO complexity drops from to where is SRAM size. On an A100:
- 2–4× wall-clock speedup over PyTorch standard attention
- 5–20× reduction in GPU memory usage for the attention operation
The mathematical output is bit-for-bit identical to standard attention.
Figure 3: Standard attention makes 3 round-trips over the n×n matrix. Flash Attention keeps tiles in SRAM and writes only the output to HBM.
Subsequent Versions
- Flash Attention 2 (Dao, 2023): restructures work partitioning across GPU warps to reduce non-matmul FLOPs and improve parallelism. Roughly 2× faster than FA1.
- Flash Attention 3 (Shah et al., 2024): targets the H100’s Hopper architecture specifically — uses warp-specialized pipelines, asynchronous memory copies, and FP8 precision. Achieves up to 75% of the H100’s theoretical FP8 peak FLOPS.
How Modern LLMs Combine These
No single variant won. Production models stack them because the bottlenecks they address are independent:
| Model | KV Reduction | Long Context | IO Efficiency |
|---|---|---|---|
| GPT-3 (2020) | MHA — none | — | Standard |
| PaLM (2022) | MQA | — | Standard |
| Llama 2 70B (2023) | GQA | — | Flash Attention 2 |
| Mistral 7B (2023) | GQA | Sliding Window () | Flash Attention 2 |
| Llama 3 8B/70B (2024) | GQA | — | Flash Attention 2 |
| Gemma 7B (2024) | GQA | — | Flash Attention 2 |
| Qwen2 7B/72B (2024) | GQA | — | Flash Attention 2 |
| Qwen2.5 72B (2024) | GQA | 128K (YaRN) | Flash Attention 2 |
| DeepSeek-V2 (2024) | MLA | 128K | Flash Attention 2 |
| DeepSeek-V3 (2024) | MLA | 128K | Flash Attention 2 |
| Kimi k1.5 (2025) | MHA | 128K | Flash Attention 2 |
| Kimi K2 (2025) | MLA | 128K | Flash Attention 2 |
The combinations are not arbitrary. Each technique addresses a different bottleneck:
- GQA attacks the KV cache — the memory cost per sequence during inference.
- Sliding Window Attention attacks the per-layer compute cost for long inputs.
- Flash Attention attacks GPU memory bandwidth — it applies regardless of which attention variant you use.
- MLA attacks the KV cache more aggressively than GQA, using low-rank compression instead of head-sharing.
A New Direction: Multi-head Latent Attention (MLA)
GQA reduces the KV cache by sharing key/value heads across query groups — but the cache still grows linearly with sequence length. DeepSeek-V2 (Liu et al., 2024) introduced a different approach: compress the KV cache into a low-dimensional latent vector and store that instead.
The core idea: instead of caching full and matrices, compute a compressed latent where :
At inference, decompress on the fly:
Only is cached. For DeepSeek-V2, this reduces the KV cache per token by 93.3% compared to standard MHA. GQA with groups achieves roughly an 8× reduction; MLA achieves roughly a 57–93× reduction depending on configuration.
One complication: Rotary Position Embeddings (RoPE) cannot be applied to compressed keys — position information is entangled with the compression. DeepSeek-V2 solves this with decoupled RoPE: a separate set of query and key vectors carries position, applied independently before the attention dot product. The compressed KV path handles content; the RoPE path handles position.
MLA is compatible with Flash Attention and can be used alongside sliding window patterns. DeepSeek-V3 and Kimi K2 both adopted MLA directly, making it the dominant KV cache strategy for frontier MoE models as of 2025.
Qwen2 and Long-Context Extrapolation
Qwen2 uses GQA across all model sizes and adds Dual Chunk Attention (DCA) for long-context inference. DCA splits the sequence into fixed-size chunks and uses three query types per token: one for attending within the same chunk, one for attending to the preceding chunk, and one for attending to all other chunks with relative position clamped. This allows Qwen2 to extrapolate from a 32K training context to 128K at inference without retraining, using position interpolation (YaRN) alongside DCA. Qwen2.5 extends the same approach to 1M tokens in the Turbo variant, adding sparse attention patterns based on MInference.
Decision Framework
If you are building or fine-tuning a model:
- Hitting KV cache limits during inference → add GQA (simpler, widely supported) or MLA (higher compression, more implementation complexity)
- Hitting sequence length limits → add sliding window or linear attention
- Hitting training throughput → enable Flash Attention (always do this regardless of attention variant)
- Need ultra-long context without retraining → combine YaRN position scaling with DCA or sparse attention patterns
What this post does not cover: state-space models (Mamba, S4) and hybrid architectures (Jamba, Zamba) represent a different branch of the scaling tree — replacing attention with selective state transitions rather than approximating it. That is a separate post.
References
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.
-
Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150.
-
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., & Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP 2023.
-
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating Long Sequences with Sparse Transformers. arXiv:1904.10509.
-
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv:2004.05150.
-
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). Mistral 7B. arXiv:2310.06825.
-
Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. ICML 2020.
-
Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, Ł., Belanger, D., Colwell, L., & Weller, A. (2020). Rethinking Attention with Performers. ICLR 2021.
-
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
-
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ICLR 2024.
-
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. arXiv:2407.08608.
-
DeepSeek-AI, Liu, A., Feng, B., et al. (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434.
-
DeepSeek-AI, Liu, A., Feng, B., et al. (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437.
-
Qwen Team, Alibaba Group. (2024). Qwen2 Technical Report. arXiv:2407.10671.
-
Qwen Team, Alibaba Group. (2024). Qwen2.5 Technical Report. arXiv:2412.15115.
-
Kimi Team, Moonshot AI. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. arXiv:2501.12599.