Do multilingual language models share abstract grammatical representations across languages, and if so, when do these develop? Following Sinclair et al. (2022), we use structural priming to test for abstract grammatical representations with causal effects on model outputs. We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-training. We find that crosslingual structural priming effects emerge early after exposure to the second language, with less than 1M tokens of data in that language. We discuss implications for data contamination, low-resource transfer, and how abstract grammatical representations emerge in multilingual models.
翻译:多语言语言模型是否跨语言共享抽象语法表征?若如此,这些表征何时形成?继Sinclair等人(2022)的研究,我们采用结构启动方法测试对模型输出具有因果效应的抽象语法表征。我们将该方法扩展至荷兰语-英语双语场景,并在预训练过程中评估一个荷英双语语言模型。研究发现,跨语言结构启动效应在接触第二语言后早期即出现——仅需该语言不足100万词元的训练数据。我们讨论了这些发现对数据污染、低资源迁移以及多语言模型中抽象语法表征形成机制的意义。