Despite their dominance in modern DL and, especially, NLP domains, transformer architectures exhibit sub-optimal performance on long-range tasks compared to recent layers that are specifically designed for this purpose. In this work, drawing inspiration from key attributes of long-range layers, such as state-space layers, linear RNN layers, and global convolution layers, we demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena (LRA) benchmark, thus narrowing the gap with these specialized layers. We identify that two key principles for long-range tasks are (i) incorporating an inductive bias towards smoothness, and (ii) locality. As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters. Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies.
翻译:尽管Transformer在现代深度学习(尤其是自然语言处理领域)占据主导地位,但与近期专门针对长程任务设计的层结构相比,其架构在长程任务上表现欠佳。本研究受长程层(如状态空间层、线性RNN层和全局卷积层)关键特性的启发,证明通过对Transformer架构进行最小程度的修改,即可在长程竞技场(LRA)基准测试中显著提升性能,从而缩小其与专用层结构之间的差距。我们发现长程任务的两个核心原则是:(i)引入面向平滑性的归纳偏置,以及(ii)局部性。研究表明,将这些思想融入注意力机制后,能以忽略不计的额外计算量和零额外可训练参数提升性能。我们的理论与实验同时揭示了Transformer在长程任务中表现欠佳的原因,并确定了成功捕获长程依赖关系所需的关键属性。