As Large Language Models (LLMs) scale to longer context windows, the computational cost of attention mechanisms, which traditionally grows quadratically with input length, presents a critical challenge for real-time and memory-constrained deployments. Existing sparse attention techniques have sought to reduce this complexity, but they often incur significant overhead or compromise accuracy, making them less practical for large contexts on mid-range hardware. In this paper, we introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics, effectively flattening the attention complexity curve. Our approach is effective for input lengths starting at 16K tokens and scales efficiently up to 128K tokens on dual NVIDIA A5000 GPUs (24GB each). Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTFT) latency at 32K tokens, while also providing substantial memory savings. These improvements yield practical gains for memory-intensive applications and long-context tasks that were previously infeasible with standard attention. Beyond latency reductions, SparseAccelerate fundamentally shifts the scaling trend, demonstrating the smallest TTFT growth gradient relative to context length among competing methods. Ongoing evaluations on diverse benchmarks confirm its scalability, positioning SparseAccelerate as a critical advancement toward efficient, real-time, and large-context LLM inference on accessible hardware.
翻译:随着大语言模型(LLM)的上下文窗口不断扩展,注意力机制的计算成本(传统上随输入长度呈二次方增长)对实时部署和内存受限场景构成了严峻挑战。现有的稀疏注意力技术虽致力于降低复杂度,但往往伴随显著开销或精度损失,使其难以在中端硬件上处理大规模上下文。本文提出SparseAccelerate——一种动态稀疏注意力方法,能够根据输入特征自适应调整稀疏模式,从而有效平抑注意力复杂度曲线。该方法在输入长度从16K词元开始时即显效,并在双NVIDIA A5000 GPU(各24GB显存)上可高效扩展至128K词元。实验结果表明:在32K词元长度下,SparseAccelerate将首词元生成时间(TTFT)延迟降低达1.04倍,同时显著节约内存。这些改进为内存密集型应用和长上下文任务带来了实际增益,而此类任务在标准注意力机制下原本难以实现。除延迟降低外,SparseAccelerate从根本上改变了扩展趋势,其TTFT随上下文长度的增长梯度在同类方法中最小。在多类基准测试中的持续评估验证了其可扩展性,标志着SparseAccelerate朝着在易得硬件上实现高效、实时、大上下文LLM推理迈出了关键一步。