Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.
翻译:近期架构的发展使得循环神经网络(RNN)在某些序列建模任务上能够达到甚至超越Transformer的性能。这些现代RNN具有一个显著的设计模式:线性循环层通过带有乘法门控的前馈路径相互连接。在此,我们展示了配备这两个设计元素的RNN如何精确实现(线性)自注意力——Transformer的主要构建模块。通过对一组训练后的RNN进行逆向工程,我们发现梯度下降在实践中发现了我们的构造。具体而言,我们检查了为解决Transformer擅长的基础上下文学习任务而训练的RNN,发现梯度下降在这些RNN中注入了与Transformer相同的基于注意力的上下文学习算法。我们的发现突显了神经网络中乘法交互的重要性,并表明某些RNN可能意外地在内部实现了注意力机制。