As there is a scarcity of large representative corpora for most languages, it is important for Multilingual Language Models (MLLM) to extract the most out of existing corpora. In this regard, script diversity presents a challenge to MLLMs by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. In this paper, we pretrain two ALBERT models to empirically measure the effect of transliteration on MLLMs. We specifically focus on the Indo-Aryan language family, which has the highest script diversity in the world. Afterward, we evaluate our models on the IndicGLUE benchmark. We perform Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity (CLRS) of the models using centered kernel alignment (CKA) on parallel sentences of eight languages from the FLORES-101 dataset. We find that the hidden representations of the transliteration-based model have higher and more stable CLRS scores. Our code is available at Github (github.com/ibraheem-moosa/XLM-Indic) and Hugging Face Hub (huggingface.co/ibraheemmoosa/xlmindic-base-multiscript and huggingface.co/ibraheemmoosa/xlmindic-base-uniscript).
翻译:由于大多数语言缺乏大规模代表性语料库,多语言语言模型(MLLM)必须从现有语料中最大化信息提取效率。在此背景下,文字多样性因降低近亲语言间的词汇重叠度而对MLLM构成挑战。因此,将使用不同书写系统的近亲语言统一音译至同一脚本,可能提升MLLM的下游任务性能。本文预训练了两个ALBERT模型,通过实验量化音译对MLLM的影响,重点关注全球文字多样性最高的印度-雅利安语系。随后,我们在IndicGLUE基准上评估模型,并采用Mann-Whitney U检验严格验证音译效果的显著性。实验表明:音译在未对资源丰富语言产生负面影响的前提下,有效提升了低资源语言的表现。我们进一步利用FLORES-101数据集中的八种语言平行句,通过中心核对齐(CKA)方法测量模型的跨语言表征相似度(CLRS),发现基于音译的模型隐藏层表征具有更高且更稳定的CLRS得分。相关代码已开源至GitHub(github.com/ibraheem-moosa/XLM-Indic)与Hugging Face Hub(huggingface.co/ibraheemmoosa/xlmindic-base-multiscript及huggingface.co/ibraheemmoosa/xlmindic-base-uniscript)。