Transformer-based architectures have demonstrated remarkable success across various domains, but their deployment on edge devices remains challenging due to high memory and computational demands. In this paper, we introduce a novel Reuse Attention mechanism, tailored for efficient memory access and computational optimization, enabling seamless operation on resource-constrained platforms without compromising performance. Unlike traditional multi-head attention (MHA), which redundantly computes separate attention matrices for each head, Reuse Attention consolidates these computations into a shared attention matrix, significantly reducing memory overhead and computational complexity. Comprehensive experiments on ImageNet-1K and downstream tasks show that the proposed UniForm models leveraging Reuse Attention achieve state-of-the-art imagenet classification accuracy while outperforming existing attention mechanisms, such as Linear Attention and Flash Attention, in inference speed and memory scalability. Notably, UniForm-l achieves a 76.7% Top-1 accuracy on ImageNet-1K with 21.8ms inference time on edge devices like the Jetson AGX Orin, representing up to a 5x speedup over competing benchmark methods. These results demonstrate the versatility of Reuse Attention across high-performance GPUs and edge platforms, paving the way for broader real-time applications
翻译:基于Transformer的架构已在多个领域展现出卓越性能,但其在边缘设备上的部署仍因高内存和计算需求而面临挑战。本文提出一种新颖的复用注意力机制,专为高效内存访问和计算优化而设计,可在不牺牲性能的前提下实现资源受限平台上的无缝运行。与传统多头注意力机制为每个注意力头冗余计算独立注意力矩阵不同,复用注意力机制将这些计算整合至共享注意力矩阵中,显著降低了内存开销与计算复杂度。在ImageNet-1K及下游任务上的综合实验表明,采用复用注意力机制的UniForm模型在实现最先进的ImageNet分类精度的同时,在推理速度和内存可扩展性方面均优于现有注意力机制(如线性注意力与闪电注意力)。值得注意的是,UniForm-l模型在Jetson AGX Orin等边缘设备上以21.8毫秒的推理时间达到76.7%的ImageNet-1K Top-1准确率,相比基准方法最高可实现5倍加速。这些结果证明了复用注意力机制在高性能GPU与边缘平台上的通用性,为更广泛的实时应用铺平了道路。