Pretrained language models (PLMs) have shown remarkable generalization toward multiple tasks and languages. Nonetheless, the generalization of PLMs towards unseen languages is poor, resulting in significantly worse language performance, or even generating nonsensical responses that are comparable to a random baseline. This limitation has been a longstanding problem of PLMs raising the problem of diversity and equal access to language modeling technology. In this work, we solve this limitation by introducing LinguAlchemy, a regularization technique that incorporates various aspects of languages covering typological, geographical, and phylogenetic constraining the resulting representation of PLMs to better characterize the corresponding linguistics constraints. LinguAlchemy significantly improves the accuracy performance of mBERT and XLM-R on unseen languages by ~18% and ~2%, respectively compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search. LinguAlchemy enables better cross-lingual generalization to unseen languages which is vital for better inclusivity and accessibility of PLMs.
翻译:预训练语言模型(PLMs)在多项任务与多种语言上展现出卓越的泛化能力。然而,PLMs对未见语言的泛化性能较差,导致语言处理效果显著下降,甚至生成与随机基线相当的荒谬回应。这一局限性长期困扰着PLMs,引发了语言建模技术多样性及公平获取的问题。在本研究中,我们通过提出"语言炼金术"(LinguAlchemy)解决了这一局限——这是一种正则化技术,融合了语言的类型学、地理学及系统发生学特征,约束PLMs生成的表示以更准确地刻画相应语言学约束。与完全微调的模型相比,LinguAlchemy使mBERT和XLM-R在未见语言上的准确率分别提升约18%和2%,展现出高度的未见语言泛化能力。我们进一步提出了LinguAlchemy的扩展方法——AlchemyScale与AlchemyTune,可自动调整语言正则化权重,免去超参数搜索需求。LinguAlchemy实现了对未见语言更优质的跨语言泛化,这对增强PLMs的包容性与可访问性至关重要。