Most interpretability research in NLP focuses on understanding the behavior and features of a fully trained model. However, certain insights into model behavior may only be accessible by observing the trajectory of the training process. In this paper, we present a case study of syntax acquisition in masked language models (MLMs). Our findings demonstrate how analyzing the evolution of interpretable artifacts throughout training deepens our understanding of emergent behavior. In particular, we study Syntactic Attention Structure (SAS), a naturally emerging property of MLMs wherein specific Transformer heads tend to focus on specific syntactic relations. We identify a brief window in training when models abruptly acquire SAS and find that this window is concurrent with a steep drop in loss. Moreover, SAS precipitates the subsequent acquisition of linguistic capabilities. We then examine the causal role of SAS by introducing a regularizer to manipulate SAS during training, and demonstrate that SAS is necessary for the development of grammatical capabilities. We further find that SAS competes with other beneficial traits and capabilities during training, and that briefly suppressing SAS can improve model quality. These findings reveal a real-world example of the relationship between disadvantageous simplicity bias and interpretable breakthrough training dynamics.
翻译:大多数NLP领域的可解释性研究聚焦于理解完全训练后模型的行为与特征。然而,某些模型行为的洞察唯有通过观察训练过程的轨迹才能获得。本文以掩码语言模型(MLMs)的句法习得为例,展示了如何通过分析训练过程中可解释性构件的演化来深化对涌现行为的理解。具体而言,我们研究了句法注意力结构(Syntactic Attention Structure, SAS)——一种MLMs中自然涌现的特性,其中特定Transformer注意力头倾向于关注特定句法关系。我们识别出模型在训练过程中短暂习得SAS的时间窗口,并发现该窗口与损失的急剧下降同步发生。此外,SAS还会引发后续语言能力的习得。我们通过引入正则化器在训练中主动调控SAS,进一步验证了其因果作用,证明SAS是语法能力发展的必要条件。研究还发现,SAS在训练过程中与其他有益特质和能力存在竞争关系,短暂抑制SAS可提升模型质量。这些发现揭示了不利的简单性偏差与可解释的突破性训练动力学之间的真实关联案例。