We study the effect of one type of imbalance often present in real-life multilingual classification datasets: an uneven distribution of labels across languages. We show evidence that fine-tuning a transformer-based Large Language Model (LLM) on a dataset with this imbalance leads to worse performance, a more pronounced separation of languages in the latent space, and the promotion of uninformative features. We modify the traditional class weighing approach to imbalance by calculating class weights separately for each language and show that this helps mitigate those detrimental effects. These results create awareness of the negative effects of language-specific class imbalance in multilingual fine-tuning and the way in which the model learns to rely on the separation of languages to perform the task.
翻译:我们研究了现实多语言分类数据集中常见的一种不平衡类型:不同语言中标签分布不均匀。我们证明,在这种不平衡的数据集上微调基于Transformer的大语言模型(LLM)会导致性能下降、潜在空间中语言的分离更加显著,并促使模型依赖无信息特征。我们修改了传统的类别加权方法,针对每种语言分别计算类别权重,并表明这有助于减轻上述不利影响。这些结果揭示了语言特定类别不平衡在多语言微调中的负面效应,以及模型如何依赖语言分离来执行任务的方式。