Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2$\times$ faster on $224\times224$ images. At $1024\times1024$ resolution, ViG-T uses 5.2$\times$ fewer FLOPs, saves 90% GPU memory, runs 4.8$\times$ faster, and achieves 20.7% higher top-1 accuracy than DeiT-T. These results position ViG as an efficient and scalable solution for visual representation learning. Code is available at \url{https://github.com/hustvl/ViG}.
翻译:近年来,线性复杂度序列建模网络在各种计算机视觉任务上实现了与Vision Transformer相当的建模能力,同时使用更少的浮点运算次数和内存。然而,其在实际运行速度方面的优势并不显著。为解决这一问题,我们引入了面向视觉任务的门控线性注意力机制,利用其优异的硬件感知特性与效率。我们提出了方向性门控机制,通过双向建模捕获一维全局上下文;并提出二维门控局部性注入机制,将二维局部细节自适应地注入一维全局上下文中。我们提出的硬件感知实现进一步将前向与后向扫描合并至单一内核,增强了并行性并降低了内存开销与延迟。所提出的模型ViG在ImageNet及下游任务中实现了精度、参数量和浮点运算次数的良好权衡,性能优于主流的基于Transformer和CNN的模型。值得注意的是,ViG-S在达到DeiT-B相同精度的同时,仅使用其27%的参数和20%的浮点运算量,且在$224\times224$图像上的运行速度快2倍。在$1024\times1024$分辨率下,ViG-T相比DeiT-T减少了5.2倍浮点运算量,节省90%的GPU内存,运行速度快4.8倍,且实现了20.7%的更高top-1精度。这些结果表明ViG是视觉表示学习领域高效且可扩展的解决方案。代码发布于\url{https://github.com/hustvl/ViG}。