Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.

翻译：注意力机制的二次计算成本限制了长上下文大语言模型的可扩展性，尤其在资源受限的环境中。现有的静态稀疏方法（如滑动窗口或全局标记）利用注意力的稀疏性来降低计算开销，但由于其静态特性，难以适应注意力随内容变化的动态特性。尽管先前研究提出了多种动态方法以提升灵活性，这些方法仍依赖于预定义模板或启发式机制。此类策略降低了通用性，并可能剪枝掉在上下文中仍具重要性的标记，从而限制了其在多样化任务中的准确性。为克服现有长上下文建模方法的瓶颈，本文提出动态分层稀疏注意力（DHSA），一种无需重新训练即可在线动态预测注意力稀疏性的数据驱动框架。所提出的DHSA自适应地将序列分割为可变长度的块，然后通过聚合每个块内的标记嵌入来计算块表示。为避免因块长度变化引入的偏差，我们采用长度归一化聚合方法，将平均嵌入按块大小的平方根进行缩放。最后，DHSA将块级相似度分数上采样至标记级相似度，以计算重要性分数，从而确定应保留哪些标记级交互。在Gemma2模型上进行的“大海捞针”测试和LongBench实验表明，DHSA在准确性上与稠密注意力相当，同时将预填充延迟降低20-60%，峰值内存使用减少35%。相较于其他代表性基线（如块稀疏注意力），DHSA在计算成本相当或更低的条件下，实现了持续更高的准确性（相对提升6-18%），为设备端长上下文大语言模型提供了高效且适应性强的解决方案。