Transformers have become the foundation of numerous state-of-the-art AI models across diverse domains, thanks to their powerful attention mechanism for modeling long-range dependencies. However, the quadratic scaling complexity of attention poses significant challenges for efficient hardware implementation. While techniques such as quantization and pruning help mitigate this issue, selective token attention offers a promising alternative by narrowing the attention scope to only the most relevant tokens, reducing computation and filtering out noise. In this work, we propose SATA, a locality-centric dynamic scheduling scheme that proactively manages sparsely distributed access patterns from selective Query-Key operations. By reordering operand flow and exploiting data locality, our approach enables early fetch and retirement of intermediate Query/Key vectors, improving system utilization. We implement and evaluate our token management strategy in a control and compute system, using runtime traces from selective-attention-based models. Experimental results show that our method improves system throughput by up to 1.76x and boosts energy efficiency by 2.94x, while incurring minimal scheduling overhead.
翻译:Transformer凭借其强大的注意力机制在建模长距离依赖方面的卓越表现,已成为众多跨领域前沿AI模型的基石。然而,注意力机制的二次方扩展复杂度给硬件高效实现带来了严峻挑战。虽然量化和剪枝等技术有助于缓解该问题,但选择性令牌注意力通过将注意力范围限定于最相关的令牌,在减少计算量的同时有效过滤噪声,为此提供了极具前景的替代方案。本研究提出SATA——一种以局部性为核心的动态调度方案,该方案能主动管理来自选择性查询-键值操作所产生的稀疏分布访问模式。通过重排序操作数流并充分利用数据局部性,我们的方法实现了中间查询/键值向量的预取与提前释放,从而提升系统利用率。我们在控制与计算系统中实现了该令牌管理策略,并基于选择性注意力模型的运行时轨迹进行评估。实验结果表明:该方法在仅产生极小调度开销的前提下,可实现最高1.76倍的系统吞吐量提升与2.94倍的能效提升。