Tags › #transformers 2 posts
-
From Quadratic to Linear: A Survey of Subquadratic Sparse Attention
Why standard attention breaks at 128K tokens, how four families of efficient attention tried and partially failed to fix it, and how content-dependent sparse routing achieves linear scaling without sacrificing retrieval accuracy.
-
The Attention Bottleneck: How Modern LLMs Solved a Problem That Nearly Broke the Transformer
From vanilla multi-head attention to Flash Attention 3 — the engineering bottlenecks that drove every major attention variant and the math behind each fix.