Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.
翻译:语言模型通过预训练获得广泛能力,然而预训练过程仍是一个黑箱。本研究采用名为crosscoders的稀疏字典学习方法,追踪预训练快照中线性可解释特征的演化轨迹。研究发现,多数特征在特定时间点开始形成,而更复杂的模式在训练后期阶段出现。特征归因分析揭示了特征演化与下游性能之间的因果关联。我们在特征层面的观察结果与先前关于Transformer两阶段学习过程的研究高度一致,我们将这两个阶段分别称为统计学习阶段和特征学习阶段。本工作为追踪语言模型学习动态中细粒度表征的演进过程开辟了可能性。