StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

from arxiv, 11 pages, 3 figures, 7 tables, 2 algorithms, 36 references. Memory-bounded indexer kernel for DeepSeek-V4 CSA via chunked partition-merge top-k. Code: https://github.com/RightNow-AI/StreamIndex

DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex runs the same indexer to S=1,048,576 with 6.21 GB peak HBM, a 32x regime extension. Set-overlap recall against the materialize ground truth is bit-exact at small S where both fit; across three 5-point design-space sweeps (chunk size, key-tile size, top-k), mean recall rounds to 1.0000 with min recall at least 0.9980 in every cell. The chunked driver composes with TileLang's pipelined attention kernel: at S=262,144 with V4-Flash dimensions, the materialize indexer paired with TileLang attention OOMs while the chunked indexer paired with the same attention runs in 1.97 s at 18.56 GB peak. Our contribution targets the indexer step; we make no claim of a faster attention kernel or of real-checkpoint end-to-end behavior. Code: https://github.com/RightNow-AI/StreamIndex.

翻译：DeepSeek-V3.2与V4引入了压缩稀疏注意力（CSA）：一种闪电索引器（基于压缩键的学习性评分投影）对键进行评分，为每个查询选择Top-k，并由稀疏注意力核仅读取这些键。公开的CSA实现会在执行Top-k规约前物化一个维度为[B, S, H_I, T]的FP32分数张量。在索引头数H_I=64且V4-Flash压缩比m=4的情况下，当序列长度S=65,536时，该中间张量体积达256 GB，超出任何单GPU高带宽内存（HBM）容量。本文提出StreamIndex——一种基于Triton实现的CSA流水线，其核心组件是分块分区-归并Top-k驱动器，该驱动器从未物化完整的中间张量。在单个NVIDIA H200上使用合成但符合实际的V4形态输入（索引器层级），对于V4-Flash维度参数，物化路径在S=65,536时发生内存溢出（OOM）；而StreamIndex在相同索引器下可处理S=1,048,576，峰值HBM占用仅6.21 GB，序列长度扩展达32倍。与物化基准相比，在小规模（两者均可适配）下的集合重叠召回率实现比特级精确；在三个5点设计空间扫描（分块大小、键块大小、Top-k）中，平均召回率四舍五入后均为1.0000，每个单元格的最小召回率不低于0.9980。该分块驱动器与TileLang的流水线注意力核兼容：在V4-Flash维度参数下S=262,144时，物化索引器配合TileLang注意力核发生OOM，而分块索引器配合相同注意力核仅需1.97秒运行，峰值内存18.56 GB。我们的贡献聚焦于索引器步骤，不声称拥有更快的注意力核或真实检查点的端到端行为。代码：https://github.com/RightNow-AI/StreamIndex。