The Fon language, spoken by an average 2 million of people, is a truly low-resourced African language, with a limited online presence, and existing datasets (just to name but a few). Multitask learning is a learning paradigm that aims to improve the generalization capacity of a model by sharing knowledge across different but related tasks: this could be prevalent in very data-scarce scenarios. In this paper, we present the first explorative approach to multitask learning, for model capabilities enhancement in Natural Language Processing for the Fon language. Specifically, we explore the tasks of Named Entity Recognition (NER) and Part of Speech Tagging (POS) for Fon. We leverage two language model heads as encoders to build shared representations for the inputs, and we use linear layers blocks for classification relative to each task. Our results on the NER and POS tasks for Fon, show competitive (or better) performances compared to several multilingual pretrained language models finetuned on single tasks. Additionally, we perform a few ablation studies to leverage the efficiency of two different loss combination strategies and find out that the equal loss weighting approach works best in our case. Our code is open-sourced at https://github.com/bonaventuredossou/multitask_fon.
翻译:丰语作为一种真正的低资源非洲语言,使用者约200万人,其在线语料极为有限且现有数据集屈指可数。多任务学习是一种通过在不同但相关的任务间共享知识来提升模型泛化能力的学习范式,尤其在数据极度稀缺的场景中具有重要价值。本文首次探索将多任务学习应用于丰语的自然语言处理领域,以增强模型能力。具体而言,我们针对丰语的命名实体识别(Named Entity Recognition, NER)和词性标注(Part of Speech Tagging, POS)任务展开研究。我们采用两个语言模型头作为编码器来构建输入的共享表示,并针对每个任务使用线性层模块进行分类。实验结果表明,在丰语的NER和POS任务上,我们的方法相比多个在单项任务上微调的多语言预训练语言模型,取得了具有竞争力(或更优)的性能。此外,我们通过消融实验比较了两种不同损失组合策略的效果,发现等权损失加权法在本场景中表现最佳。相关代码已开源至https://github.com/bonaventuredossou/multitask_fon。