Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, \name{}, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, \name{}-S matches DeiT-B's accuracy while using only 27\% of the parameters and 20\% of the FLOPs, running 2$\times$ faster on $224\times224$ images. At $1024\times1024$ resolution, \name{}-T uses 5.2$\times$ fewer FLOPs, saves 90\% GPU memory, runs 4.8$\times$ faster, and achieves 20.7\% higher top-1 accuracy than DeiT-T. These results position \name{} as an efficient and scalable solution for visual representation learning. Code is available at \url{https://github.com/hustvl/ViG}.
翻译:近年来,线性复杂度序列建模网络在各种计算机视觉任务上实现了与Vision Transformer相当的建模能力,同时使用更少的浮点运算量和内存。然而,其在实际运行速度方面的优势并不显著。为解决这一问题,我们为视觉任务引入了门控线性注意力(Gated Linear Attention, GLA),利用其优异的硬件感知能力和效率。我们提出了方向性门控机制,通过双向建模捕获一维全局上下文,并采用二维门控局部性注入,将二维局部细节自适应地注入到一维全局上下文中。我们提出的硬件感知实现进一步将前向与后向扫描合并至单一内核,从而提升了并行性,降低了内存开销与延迟。所提出的模型 \name{} 在ImageNet及下游任务上实现了精度、参数量和浮点运算量之间的良好权衡,其性能优于主流的基于Transformer和CNN的模型。值得注意的是,\name{}-S 在仅使用27%的参数量和20%的浮点运算量的情况下,达到了与DeiT-B相当的精度,且在 $224\times224$ 图像上的运行速度提升了2倍。在 $1024\times1024$ 分辨率下,\name{}-T 相比DeiT-T 减少了5.2倍的浮点运算量,节省了90%的GPU内存,运行速度提升了4.8倍,并实现了20.7%更高的Top-1精度。这些结果表明 \name{} 是一种高效且可扩展的视觉表示学习解决方案。代码发布于 \url{https://github.com/hustvl/ViG}。