An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
翻译:现代循环序列模型的一个核心组件是遗忘门。虽然Transformer没有显式的循环形式,但我们证明,通过以数据依赖的方式对未归一化的注意力分数进行降权,可以自然地将遗忘门融入Transformer中。我们将这种注意力机制命名为遗忘注意力,并将由此产生的模型称为遗忘Transformer(FoX)。我们证明,FoX在长上下文语言建模、长度外推和短上下文下游任务上优于Transformer,而在长上下文下游任务上与Transformer表现相当。此外,它与FlashAttention算法兼容,并且不需要任何位置嵌入。包括“大海捞针”测试在内的多项分析表明,FoX也保留了Transformer相对于循环序列模型(如Mamba-2、HGRN2和DeltaNet)的优越长上下文能力。我们还引入了一种“Pro”块设计,它融合了循环序列模型中的一些常见架构组件,并发现它显著提升了FoX和Transformer的性能。我们的代码可在 https://github.com/zhixuan-lin/forgetting-transformer 获取。