In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.
翻译:近年来,预训练语言模型(PLMs)的发展势头强劲,展现出跨越语言障碍、促进跨语言知识迁移的能力。然而,这一进展主要绕过了极低资源语言的纳入,在多语言格局中造成了显著空白。本文通过引入四种针对安哥拉语言专门微调的定制化PLMs来填补这一空白,采用多语言自适应微调(MAFT)方法。我们探究了信息嵌入初始化与合成数据在提升MAFT模型下游任务性能中的作用。与通过MAFT开发的最优基线模型AfroXLMR-base以及有效嵌入初始化方法OFA相比,我们在基准测试上分别提升了12.3和3.8个百分点。