Dynamic sparse attention (DSA) reduces the per-token attention bandwidth by restricting computation to a top-k subset of cached key-value (KV) entries, but its token-dependent selection pattern introduces a system-level challenge: the KV working set is fragmented, volatile, and difficult to prefetch, which can translate into poor cache locality and stalled decode throughput. We study these effects by implementing a lightweight indexer for DSA-style selection on multiple open-source backbones and logging per-layer KV indices during autoregressive decoding. Our analysis shows a gap in serving DSA backbones - a potential for a high volume of blocking LL (last level) cache miss events, causing inefficiency; we propose a novel LL cache reservation system to save KV tokens in the LL cache between decode steps, combined with a token-granularity LRU eviction policy, and show on the data we collected how this architecture can benefit serving with DSA implemented on different backbones. Finally, we propose directions for future architectural and algorithmic exploration to improve serving of DSA on modern inference platforms.
翻译:动态稀疏注意力(DSA)通过将计算限制在缓存键值(KV)条目的 top-k 子集来降低每令牌的注意力带宽,但其令牌依赖的选择模式引入了系统级挑战:KV 工作集呈现碎片化、易变性且难以预取,这可能导致缓存局部性差和解码吞吐量停滞。我们通过为多个开源骨干网络实现轻量级 DSA 风格选择索引器,并在自回归解码过程中记录每层 KV 索引来研究这些影响。分析表明,在服务 DSA 骨干网络时存在性能缺口——可能出现大量阻塞性末级(LL)缓存未命中事件,导致效率低下;我们提出一种新颖的末级缓存预留系统,用于在解码步骤间将 KV 令牌保存在末级缓存中,并结合令牌粒度 LRU 淘汰策略,基于收集的数据展示了该架构如何在不同骨干网络上实现 DSA 时提升服务性能。最后,我们提出了未来架构与算法探索的方向,以改进 DSA 在现代推理平台上的服务效能。