Multilingual large language models (MLLMs) are jointly trained on data from many different languages such that representation of individual languages can benefit from other languages' data. Impressive performance on zero-shot cross-lingual transfer shows that these models are capable of exploiting data from other languages. Yet, it remains unclear to what extent, and under which conditions, languages rely on each other's data. In this study, we use TracIn (Pruthi et al., 2020), a training data attribution (TDA) method, to retrieve the most influential training samples seen during multilingual fine-tuning for a particular test language. This allows us to analyse cross-lingual sharing mechanisms of MLLMs from a new perspective. While previous work studied cross-lingual sharing at the level of model parameters, we present the first approach to study cross-lingual sharing at the data level. We find that MLLMs rely on data from multiple languages from the early stages of fine-tuning and that this reliance gradually increases as fine-tuning progresses. We further study how different fine-tuning languages influence model performance on a given test language and find that they can both reinforce and complement the knowledge acquired from data of the test language itself.
翻译:多语言大语言模型(MLLMs)通过联合训练来自多种不同语言的数据,使得单个语言的表征能够受益于其他语言的数据。这些模型在零样本跨语言迁移中展现出的卓越性能表明,它们能够有效利用其他语言的数据。然而,在多大程度上以及何种条件下语言会依赖彼此的数据,目前仍不明确。本研究采用训练数据归因方法TracIn(Pruthi等人,2020),从多语言微调过程中提取对特定测试语言最具影响力的训练样本。这使我们能够从全新视角分析MLLMs的跨语言共享机制。此前研究主要在模型参数层面探讨跨语言共享,而本研究首次提出了在数据层面研究跨语言共享的方法。我们发现,MLLMs从微调初期就开始依赖多语言数据,并且这种依赖程度随着微调过程的推进逐步增强。我们进一步研究了不同微调语言对给定测试语言模型性能的影响,发现这些语言既能强化也能补充从测试语言自身数据中获取的知识。