Pretrained language models (PLMs) have become remarkably adept at task and language generalization. Nonetheless, they often fail when faced with unseen languages. In this work, we present LinguAlchemy, a regularization method that incorporates various linguistic information covering typological, geographical, and phylogenetic features to align PLMs representation to the corresponding linguistic information on each language. Our LinguAlchemy significantly improves the performance of mBERT and XLM-R on low-resource languages in multiple downstream tasks such as intent classification, news classification, and semantic relatedness compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search.
翻译:预训练语言模型(PLM)在任务泛化和语言泛化方面已展现出卓越能力。然而,当面对未见语言时,这些模型往往表现不佳。本研究提出LinguAlchemy——一种融合类型学、地理谱系等多元语言学特征的规范化方法,旨在将PLM的表征与各语言对应的语言学信息对齐。相较于完全微调模型,我们的LinguAlchemy显著提升了mBERT和XLM-R在意图分类、新闻分类、语义相关性等多个下游任务中对低资源语言的性能,并展现出优异的未知语言泛化能力。我们进一步提出AlchemyScale与AlchemyTune这两个LinguAlchemy的扩展模块,它们能自动调整语言学规范化权重,从而免除超参数搜索的需求。