We study how features emerge, disappear, and persist across models fine-tuned on different domains of text. More specifically, we start from a base one-layer Transformer language model that is trained on a combination of the BabyLM corpus, and a collection of Python code from The Stack. This base model is adapted to two new domains of text: TinyStories, and the Lua programming language, respectively; and then these two models are merged using these two models using spherical linear interpolation. Our exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.
翻译:本研究探讨了在不同文本领域微调过程中,特征如何产生、消失与持续存在。具体而言,我们从基础的单层Transformer语言模型出发,该模型在BabyLM语料库与The Stack中的Python代码集合的混合数据上训练得到。随后,该基础模型分别适应两个新的文本领域:TinyStories故事文本与Lua编程语言;继而通过球面线性插值方法将这两个适应后的模型进行融合。本研究旨在通过小规模模型与稀疏自编码器的典型迁移学习场景,深入揭示特征跨域演化的稳定性与转换规律。