Script diversity presents a challenge to Multilingual Language Models (MLLM) by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. We empirically measure the effect of transliteration on MLLMs in this context. We specifically focus on the Indic languages, which have the highest script diversity in the world, and we evaluate our models on the IndicGLUE benchmark. We perform the Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity of the models using centered kernel alignment on parallel sentences from the FLORES-101 dataset. We find that for parallel sentences across different languages, the transliteration-based model learns sentence representations that are more similar.
翻译:文字系统多样性通过减少紧密相关语言间的词汇重叠,对多语言语言模型(MLLM)构成了挑战。因此,将使用不同书写系统的紧密相关语言音译至通用文字系统,可能提升MLLM的下游任务性能。本研究在此背景下实验性测量了音译对MLLM的影响。我们特别聚焦于文字系统多样性全球最高的印度语言,并基于IndicGLUE基准评估模型。通过曼-惠特尼U检验严格验证音译效应的显著性,我们发现音译对低资源语言有益,且未对相对高资源语言造成负面影响。我们还利用FLORES-101数据集中的平行句子,通过中心核对齐方法测量模型的跨语言表征相似性,发现对于不同语言的平行句子,基于音译的模型学习到的句子表征具有更高相似度。