Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SpecSA, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SpecSA combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse, reduce selected-index and branch-fusion overheads, and select effective draft-verification strategies under user-specified precision classes. Experiments on NVIDIA H100 GPUs show that SpecSA achieves up to 3.49x end-to-end throughput over autoregressive NSA decoding and up to 6.86x kernel speedups for sparse speculative verification.
翻译:摘要:投机解码与动态稀疏注意力是加速长上下文大语言模型推理的两种互补方法:前者通过多个验证器查询摊销目标模型执行,后者则降低每次查询的KV缓存工作集。然而,直接结合两者会暴露结构不匹配问题:投机验证依赖跨查询共性,而动态稀疏注意力分配查询特定的稀疏布局。这种不匹配限制了KV块复用,放大了NSA的分支开销,并使验证策略选择依赖于输入与运行环境。本文提出SpecSA——一个将动态稀疏注意力转化为面向验证工作负载的稀疏投机验证框架。SpecSA通过融合重叠感知的分组查询执行、基于刷新/复用的NSA内核融合以及轮廓引导的提示自适应编排,提升跨查询复用率,降低选定索引与分支融合开销,并在用户指定的精度类别下选择有效的草稿验证策略。在NVIDIA H100 GPU上的实验表明,SpecSA相较于自回归NSA解码可实现最高3.49倍的端到端吞吐量提升,稀疏投机验证内核加速比最高可达6.86倍。