CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction

Recently, large efforts have been made to design efficient linear-complexity visual Transformers. However, current linear attention models are generally unsuitable to be deployed in resource-constrained mobile devices, due to suffering from either few efficiency gains or significant accuracy drops. In this paper, we propose a new de\textbf{C}oupled du\textbf{A}l-interactive linea\textbf{R} att\textbf{E}ntion (CARE) mechanism, revealing that features' decoupling and interaction can fully unleash the power of linear attention. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies, thereby preserving sufficient local and global information while effectively enhancing the efficiency of models. Then, a dynamic memory unit is employed to maintain critical information along the network pipeline. Moreover, we design a dual interaction module to effectively facilitate interaction between local inductive bias and long-range information as well as among features at different layers. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy. Extensive experiments on ImageNet-1K, COCO, and ADE20K datasets demonstrate the effectiveness of our approach, e.g., achieving $78.4/82.1\%$ top-1 accuracy on ImagegNet-1K at the cost of only $0.7/1.9$ GMACs. Codes will be released on \href{https://github.com/zhouyuan888888/CARE-Transformer}{https://github.com/zhouyuan888888/CARE-Transformer}.

翻译：近年来，研究者们投入了大量精力设计高效的线性复杂度视觉Transformer。然而，当前的线性注意力模型通常不适合部署在资源受限的移动设备上，因为它们要么效率提升有限，要么存在显著的精度下降。本文提出了一种新的解耦双交互线性注意力机制，揭示了特征解耦与交互能够充分释放线性注意力的潜力。我们首先提出了一种非对称特征解耦策略，该策略非对称地解耦了局部归纳偏置和长程依赖的学习过程，从而在有效提升模型效率的同时，保留了充足的局部与全局信息。随后，我们采用了一个动态记忆单元来沿网络流水线维护关键信息。此外，我们设计了一个双交互模块，以有效促进局部归纳偏置与长程信息之间以及不同层特征之间的交互。通过采用解耦的学习方式并充分利用特征间的互补性，我们的方法能够同时实现高效率和准确性。在ImageNet-1K、COCO和ADE20K数据集上进行的大量实验证明了我们方法的有效性，例如，在仅消耗0.7/1.9 GMACs的计算成本下，在ImageNet-1K上达到了78.4%/82.1%的top-1准确率。代码将在\href{https://github.com/zhouyuan888888/CARE-Transformer}{https://github.com/zhouyuan888888/CARE-Transformer}发布。