Representational spaces learned via language modeling are fundamental to Natural Language Processing (NLP), however there has been limited understanding regarding how and when during training various types of linguistic information emerge and interact. Leveraging a novel information theoretic probing suite, which enables direct comparisons of not just task performance, but their representational subspaces, we analyze nine tasks covering syntax, semantics and reasoning, across 2M pre-training steps and five seeds. We identify critical learning phases across tasks and time, during which subspaces emerge, share information, and later disentangle to specialize. Across these phases, syntactic knowledge is acquired rapidly after 0.5% of full training. Continued performance improvements primarily stem from the acquisition of open-domain knowledge, while semantics and reasoning tasks benefit from later boosts to long-range contextualization and higher specialization. Measuring cross-task similarity further reveals that linguistically related tasks share information throughout training, and do so more during the critical phase of learning than before or after. Our findings have implications for model interpretability, multi-task learning, and learning from limited data.
翻译:通过语言建模学习到的表示空间是自然语言处理(NLP)的基础,然而关于训练过程中各类语言信息如何以及在何时涌现和交互,目前的理解仍十分有限。我们利用一种新颖的信息论探测工具集——该工具不仅能直接比较任务性能,还能比较其表示子空间——在200万预训练步长和五个随机种子的条件下,分析了涵盖句法、语义和推理的九项任务。我们识别出跨任务和时间的若干关键学习阶段,在此阶段子空间涌现、共享信息,随后解耦以实现特化。在这些阶段中,句法知识在完成全部训练的0.5%后便快速习得。持续的绩效提升主要源于开放领域知识的获取,而语义和推理任务则得益于后期对长距离语境化能力的提升和更高的特化程度。测量跨任务相似性进一步揭示,语言学相关任务在整个训练过程中持续共享信息,且在关键学习阶段的共享程度高于前后阶段。我们的发现对模型可解释性、多任务学习以及有限数据学习具有启示意义。