Transformers demonstrate impressive performance on a range of reasoning benchmarks. To evaluate the degree to which these abilities are a result of actual reasoning, existing work has focused on developing sophisticated benchmarks for behavioral studies. However, these studies do not provide insights into the internal mechanisms driving the observed capabilities. To improve our understanding of the internal mechanisms of transformers, we present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task. We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence. Our results suggest that it implements a depth-bounded recurrent mechanisms that operates in parallel and stores intermediate results in selected token positions. We anticipate that the motifs we identified in our synthetic setting can provide valuable insights into the broader operating principles of transformers and thus provide a basis for understanding more complex models.
翻译:Transformer在一系列推理基准测试中展现出令人瞩目的性能。为了评估这些能力在多大程度上源自实际推理,现有研究工作集中于开发用于行为研究的复杂基准。然而,这些研究并未揭示驱动所观测能力的内部机制。为了加深我们对Transformer内部机制的理解,我们对一个在合成推理任务上训练的Transformer进行了全面的机制分析。我们识别出模型用于解决该任务的一组可解释机制,并通过相关性和因果证据验证了我们的发现。结果表明,该模型实现了一种深度有界的循环机制,该机制并行运行,并将中间结果存储在特定的token位置。我们预计,在合成环境中识别出的这些模式,能够为理解Transformer更广泛的工作原理提供宝贵见解,从而为理解更复杂的模型奠定基础。