Multi-scale deformable attention (MSDeformAttn) has emerged as a key mechanism in various vision tasks, demonstrating explicit superiority attributed to multi-scale grid-sampling. However, this newly introduced operator incurs irregular data access and enormous memory requirement, leading to severe PE underutilization. Meanwhile, existing approaches for attention acceleration cannot be directly applied to MSDeformAttn due to lack of support for this distinct procedure. Therefore, we propose a dedicated algorithm-architecture co-design dubbed DEFA, the first-of-its-kind method for MSDeformAttn acceleration. At the algorithm level, DEFA adopts frequency-weighted pruning and probability-aware pruning for feature maps and sampling points respectively, alleviating the memory footprint by over 80%. At the architecture level, it explores the multi-scale parallelism to boost the throughput significantly and further reduces the memory access via fine-grained layer fusion and feature map reusing. Extensively evaluated on representative benchmarks, DEFA achieves 10.1-31.9x speedup and 20.3-37.7x energy efficiency boost compared to powerful GPUs. It also rivals the related accelerators by 2.2-3.7x energy efficiency improvement while providing pioneering support for MSDeformAttn.
翻译:多尺度可变形注意力已成为多种视觉任务中的关键机制,其通过多尺度网格采样展现出显著优越性。然而,该新型算子会引发不规则数据访问和巨大内存需求,导致处理单元严重利用率不足。同时,现有注意力加速方法因缺乏对该特殊流程的支持而无法直接应用于多尺度可变形注意力。为此,我们提出了一种专用算法-架构协同设计方法DEFA,这是首个针对多尺度可变形注意力加速的方法。在算法层面,DEFA分别对特征图和采样点采用频率加权剪枝与概率感知剪枝,使内存占用降低超过80%。在架构层面,其通过探索多尺度并行性显著提升吞吐量,并借助细粒度层融合与特征图复用进一步减少内存访问。在代表性基准测试上的全面评估表明,与高性能GPU相比,DEFA实现了10.1-31.9倍速度提升和20.3-37.7倍能效提升。相较于同类加速器,该方案在率先支持多尺度可变形注意力的同时,还实现了2.2-3.7倍的能效提升。