Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.
翻译:线性注意力Transformer及其门控变体因支持并行训练和高效循环推理而备受推崇,但在需要高召回能力的任务中仍不及传统Transformer,且从头训练所需资源显著。本文提出门控槽注意力(GSA),通过引入受门控线性注意力(GLA)启发的门控机制,增强了带有限内存控制(ABC)的注意力机制。本质上,GSA由通过$\operatorname{softmax}$连接的两层GLA构成,利用上下文感知的内存读取和自适应遗忘策略,在保持紧凑循环状态规模的同时提升内存容量。该设计借助GLA的硬件高效训练算法和缩减的状态规模,极大提升了训练和推理效率。此外,保留$\operatorname{softmax}$运算在“将预训练Transformer微调为循环神经网络”(T2R)场景中尤为有益,减少了对大规模从头训练的需求。大量实验证实,GSA在需要上下文召回的场景及T2R设置中均表现出卓越性能。