Most interpretability research in NLP focuses on understanding the behavior and features of a fully trained model. However, certain insights into model behavior may only be accessible by observing the trajectory of the training process. We present a case study of syntax acquisition in masked language models (MLMs) that demonstrates how analyzing the evolution of interpretable artifacts throughout training deepens our understanding of emergent behavior. In particular, we study Syntactic Attention Structure (SAS), a naturally emerging property of MLMs wherein specific Transformer heads tend to focus on specific syntactic relations. We identify a brief window in pretraining when models abruptly acquire SAS, concurrent with a steep drop in loss. This breakthrough precipitates the subsequent acquisition of linguistic capabilities. We then examine the causal role of SAS by manipulating SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities. We further find that SAS competes with other beneficial traits during training, and that briefly suppressing SAS improves model quality. These findings offer an interpretation of a real-world example of both simplicity bias and breakthrough training dynamics.
翻译:摘要:当前NLP领域的可解释性研究大多聚焦于理解完全训练模型的性能与特征,但某些关于模型行为的洞见可能仅能通过观察训练过程的轨迹获得。我们以掩码语言模型(MLMs)的句法习得为案例,证明通过分析训练过程中可解释工件的演化,能够加深对涌现行为的理解。具体而言,我们研究了句法注意力结构(Syntactic Attention Structure, SAS)——一种MLMs中自然涌现的特性,其中特定Transformer注意力头会聚焦于特定句法关系。我们发现预训练中存在一个短暂窗口期,模型在此阶段突然习得SAS,同时伴随损失函数的急剧下降。这一突破性现象促进了后续语言能力的获取。通过操纵训练过程中的SAS,我们进一步验证了SAS在语法能力发展中的因果必要性,并发现SAS在训练过程中与其他有益特征存在竞争关系,短暂抑制SAS反而能提升模型质量。这些发现为真实场景中简单性偏好与突破性训练动力学提供了具体例证。