DeepSeek-V3.2 and V4 introduce Compressed Sparse Attention (CSA): a lightning indexer (a learned scoring projection over compressed keys) scores them, the top-k are selected per query, and a sparse attention kernel reads only those. Public CSA implementations materialize a [B, S, H_I, T] FP32 score tensor before the top-k reduction. With H_I=64 indexer heads and the V4-Flash compression ratio m=4, that intermediate is 256 GB at sequence length S=65,536, exceeding any single-GPU high-bandwidth-memory (HBM) budget. We present StreamIndex, a Triton implementation of the CSA pipeline whose central component is a chunked partition-merge top-k driver that never materializes the full intermediate. On synthetic-but-realistic V4-shaped inputs at the indexer-step (layer) level on a single NVIDIA H200, the materialize path runs out of memory (OOMs) at S=65,536 with V4-Flash dimensions; StreamIndex runs the same indexer to S=1,048,576 with 6.21 GB peak HBM, a 32x regime extension. Set-overlap recall against the materialize ground truth is bit-exact at small S where both fit; across three 5-point design-space sweeps (chunk size, key-tile size, top-k), mean recall rounds to 1.0000 with min recall at least 0.9980 in every cell. The chunked driver composes with TileLang's pipelined attention kernel: at S=262,144 with V4-Flash dimensions, the materialize indexer paired with TileLang attention OOMs while the chunked indexer paired with the same attention runs in 1.97 s at 18.56 GB peak. Our contribution targets the indexer step; we make no claim of a faster attention kernel or of real-checkpoint end-to-end behavior. Code: https://github.com/RightNow-AI/StreamIndex.
翻译:DeepSeek-V3.2与V4引入了压缩稀疏注意力(CSA):一种闪电索引器(基于压缩键的学习性评分投影)对键进行评分,为每个查询选择Top-k,并由稀疏注意力核仅读取这些键。公开的CSA实现会在执行Top-k规约前物化一个维度为[B, S, H_I, T]的FP32分数张量。在索引头数H_I=64且V4-Flash压缩比m=4的情况下,当序列长度S=65,536时,该中间张量体积达256 GB,超出任何单GPU高带宽内存(HBM)容量。本文提出StreamIndex——一种基于Triton实现的CSA流水线,其核心组件是分块分区-归并Top-k驱动器,该驱动器从未物化完整的中间张量。在单个NVIDIA H200上使用合成但符合实际的V4形态输入(索引器层级),对于V4-Flash维度参数,物化路径在S=65,536时发生内存溢出(OOM);而StreamIndex在相同索引器下可处理S=1,048,576,峰值HBM占用仅6.21 GB,序列长度扩展达32倍。与物化基准相比,在小规模(两者均可适配)下的集合重叠召回率实现比特级精确;在三个5点设计空间扫描(分块大小、键块大小、Top-k)中,平均召回率四舍五入后均为1.0000,每个单元格的最小召回率不低于0.9980。该分块驱动器与TileLang的流水线注意力核兼容:在V4-Flash维度参数下S=262,144时,物化索引器配合TileLang注意力核发生OOM,而分块索引器配合相同注意力核仅需1.97秒运行,峰值内存18.56 GB。我们的贡献聚焦于索引器步骤,不声称拥有更快的注意力核或真实检查点的端到端行为。代码:https://github.com/RightNow-AI/StreamIndex。