Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevance token pairs. However, current approaches lack practicality due to the heavy expense of added sparsity predictor, which severely drops their hardware efficiency. This paper advances the state-of-the-art (SOTA) by proposing a bit-serial enable stage-fusion (BSF) mechanism, which eliminates the need for a separate predictor. However, it faces key challenges: 1) Inaccurate bit-sliced sparsity speculation leads to incorrect pruning; 2) Hardware under-utilization due to fine-grained and imbalanced bit-level workloads. 3) Tiling difficulty caused by the row-wise dependency in sparsity pruning criteria. We propose PADE, a predictor-free algorithm-hardware co-design for dynamic sparse attention acceleration. PADE features three key innovations: 1) Bit-wise uncertainty interval-enabled guard filtering (BUI-GF) strategy to accurately identify trivial tokens during each bit round; 2) Bidirectional sparsity-based out-of-order execution (BS-OOE) to improve hardware utilization; 3) Interleaving-based sparsity-tiled attention (ISTA) to reduce both I/O and computational complexity. These techniques, combined with custom accelerator designs, enable practical sparsity acceleration without relying on an added sparsity predictor. Extensive experiments on 22 benchmarks show that PADE achieves 7.43x speed up and 31.1x higher energy efficiency than Nvidia H100 GPU. Compared to SOTA accelerators, PADE achieves 5.1x, 4.3x and 3.4x energy saving than Sanger, DOTA and SOFA.
翻译:基于注意力的模型已经彻底改变了人工智能领域,但自注意力机制的二次计算成本带来了严重的计算和内存开销。稀疏注意力方法通过跳过低相关性标记对来缓解这一问题。然而,当前方法由于添加稀疏性预测器带来的巨大开销而缺乏实用性,这严重降低了其硬件效率。本文通过提出一种比特串行使能阶段融合机制,推进了该领域的前沿,该机制消除了对独立预测器的需求。然而,它面临三个关键挑战:1) 不准确的比特切片稀疏性推测导致错误剪枝;2) 由于细粒度且不均衡的比特级工作负载导致的硬件利用率不足;3) 稀疏性剪枝准则中行间依赖关系带来的分块困难。我们提出了PADE,一种用于动态稀疏注意力加速的无预测器算法-硬件协同设计。PADE具有三项关键创新:1) 基于比特级不确定性区间的保护性过滤策略,用于在每个比特轮次中准确识别无关紧要的标记;2) 基于双向稀疏性的乱序执行机制,以提高硬件利用率;3) 基于交错处理的稀疏分块注意力机制,以同时降低I/O和计算复杂度。这些技术与定制加速器设计相结合,实现了无需依赖额外稀疏性预测器的实用化稀疏加速。在22个基准测试上的广泛实验表明,PADE相比英伟达H100 GPU实现了7.43倍的加速和31.1倍的能效提升。与前沿加速器相比,PADE相比Sanger、DOTA和SOFA分别实现了5.1倍、4.3倍和3.4倍的节能效果。