We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel \emph{association} pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regime the trained transformer finds ``shortcut" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose the LEGO attention module, a drop-in replacement for vanilla attention heads. This architectural change significantly reduces Flops and maintains or even \emph{improves} the model's performance at large-scale pretraining.
翻译:我们提出了一项名为LEGO(学习等式与群运算)的合成推理任务,该任务概括了遵循推理链的问题,并研究了Transformer架构如何学习这一任务。我们特别关注数据效应,如预训练(在看似无关的自然语言处理任务上)和数据集组成(例如训练与测试时推理链长度的差异),以及架构变体,如权重共享层或添加卷积组件。我们研究了训练后的模型如何成功完成该任务,尤其能够理解部分注意力头以及网络中信息的流动方式。具体来说,我们识别出一种新颖的“关联”模式,该模式仅全局关注相同的词元。基于这些观察,我们提出一个假设:由于特定的结构化注意力模式,预训练有助于解决LEGO任务,并通过实验验证了这一假设。我们还观察到,在某些数据条件下,训练后的Transformer会找到“捷径”解来遵循推理链,这损害了模型的鲁棒性,并提出了预防方法。受结构化注意力模式发现的启发,我们提出了LEGO注意力模块,作为标准注意力头的即插即用替代方案。这一架构改变显著减少了浮点运算量,并在大规模预训练中维持甚至提升了模型性能。