Agent Attention: On the Integration of Softmax and Linear Attention

The attention module is the key component in Transformers. While the global attention mechanism offers high expressiveness, its excessive computational cost restricts its applicability in various scenarios. In this paper, we propose a novel attention paradigm, Agent Attention, to strike a favorable balance between computational efficiency and representation power. Specifically, the Agent Attention, denoted as a quadruple $(Q, A, K, V)$, introduces an additional set of agent tokens $A$ into the conventional attention module. The agent tokens first act as the agent for the query tokens $Q$ to aggregate information from $K$ and $V$, and then broadcast the information back to $Q$. Given the number of agent tokens can be designed to be much smaller than the number of query tokens, the agent attention is significantly more efficient than the widely adopted Softmax attention, while preserving global context modelling capability. Interestingly, we show that the proposed agent attention is equivalent to a generalized form of linear attention. Therefore, agent attention seamlessly integrates the powerful Softmax attention and the highly efficient linear attention. Extensive experiments demonstrate the effectiveness of agent attention with various vision Transformers and across diverse vision tasks, including image classification, object detection, semantic segmentation and image generation. Notably, agent attention has shown remarkable performance in high-resolution scenarios, owning to its linear attention nature. For instance, when applied to Stable Diffusion, our agent attention accelerates generation and substantially enhances image generation quality without any additional training. Code is available at https://github.com/LeapLabTHU/Agent-Attention.

翻译：注意力模块是Transformer中的关键组件。虽然全局注意力机制具有高表达能力，但其过高的计算成本限制了其在多种场景中的适用性。本文提出一种新型注意力范式——Agent Attention，旨在有效平衡计算效率与表示能力。具体而言，Agent Attention表示为四元组$(Q, A, K, V)$，在传统注意力模块中引入了一组额外的代理令牌$A$。代理令牌首先作为查询令牌$Q$的代理，从$K$和$V$中聚合信息，随后将信息广播回$Q$。由于代理令牌的数量可设计为远少于查询令牌数量，Agent Attention在保持全局上下文建模能力的同时，显著优于广泛采用的Softmax注意力。有趣的是，我们证明所提出的Agent Attention等价于线性注意力的一般化形式。因此，Agent Attention实现了强大的Softmax注意力与高效线性注意力的无缝融合。大量实验表明，Agent Attention在多种视觉Transformer架构及不同视觉任务（包括图像分类、目标检测、语义分割和图像生成）中均具有有效性。值得注意的是，得益于其线性注意力特性，Agent Attention在高分辨率场景中展现出卓越性能。例如，当应用于Stable Diffusion时，无需额外训练即可加速生成过程并显著提升图像生成质量。代码已开源至https://github.com/LeapLabTHU/Agent-Attention。