Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.
翻译:近期架构的发展使循环神经网络(RNN)在特定序列建模任务上达到甚至超越Transformer的性能。这类现代RNN具有显著设计模式:通过乘法门控的前馈路径互联的线性循环层。本文证明,配备这两种设计要素的RNN能够精确实现(线性)自注意力机制——Transformer的核心构建模块。通过对一组训练完成的RNN进行逆向工程,我们发现梯度下降在实践中能够自发发现我们的构造方法。具体而言,我们考察了为求解Transformer擅长的简单上下文内学习任务而训练的RNN,结果表明梯度下降在RNN中注入了与Transformer相同的基于注意力的上下文内学习算法。本研究凸显了神经网络中乘法交互的重要性,并暗示某些RNN可能在不经意间在底层实现了注意力机制。