Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model's pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language, thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.
翻译:尽管多语言语言模型在未见语言上表现出令人印象深刻的跨语言迁移能力,但当这些语言与多语言模型预训练数据中使用语言的文字系统存在差异时,下游任务的性能会受到影响。借助音译(transliteration)提供了一种直接而有效的方法,将资源丰富语言的文字系统与目标语言对齐,从而增强跨语言迁移能力。然而,对于混合语言,这种方法并非最优,因为只有该语言的子集能从跨语言迁移中受益,而其余部分则受到阻碍。本文聚焦于马耳他语——一种闪米特语言,受到阿拉伯语、意大利语和英语的显著影响,且采用拉丁字母书写。我们提出了一个带有词级词源标注的新数据集。利用该数据集,我们训练了一个分类器,能够针对马耳他语中每个词元(token)的处理方式做出合理决策。我们将不加区分的音译或翻译方法与混合处理管线(仅对源自阿拉伯语的词语进行音译,从而产生混合文字系统的文本)进行对比。我们在四个下游任务上对处理后的数据进行微调,结果表明,基于词源的条件性音译取得了最佳效果,超越了使用原始马耳他语或非选择性管线处理后的马耳他语的微调结果。