The Fon language, spoken by an average 2 million of people, is a truly low-resourced African language, with a limited online presence, and existing datasets (just to name but a few). Multitask learning is a learning paradigm that aims to improve the generalization capacity of a model by sharing knowledge across different but related tasks: this could be prevalent in very data-scarce scenarios. In this paper, we present the first explorative approach to multitask learning, for model capabilities enhancement in Natural Language Processing for the Fon language. Specifically, we explore the tasks of Named Entity Recognition (NER) and Part of Speech Tagging (POS) for Fon. We leverage two language model heads as encoders to build shared representations for the inputs, and we use linear layers blocks for classification relative to each task. Our results on the NER and POS tasks for Fon, show competitive (or better) performances compared to several multilingual pretrained language models finetuned on single tasks. Additionally, we perform a few ablation studies to leverage the efficiency of two different loss combination strategies and find out that the equal loss weighting approach works best in our case. Our code is open-sourced at https://github.com/bonaventuredossou/multitask_fon.
翻译:丰语由约200万人使用,是一种真正低资源的非洲语言,其在线资源极为有限,现有数据集也屈指可数。多任务学习是一种通过在不同但相关的任务间共享知识来提升模型泛化能力的学习范式,在数据稀缺场景下尤为有效。本文首次探索了将多任务学习用于增强丰语自然语言处理模型能力的方法。具体而言,我们针对丰语的命名实体识别(NER)和词性标注(POS)任务展开研究:采用两个语言模型头作为编码器构建输入的共享表示,并利用线性层模块完成各任务对应的分类。实验结果表明,在丰语的NER和POS任务上,我们的方法相比在单一任务上微调的多语言预训练语言模型,展现出具有竞争力(甚至更优)的性能。此外,我们通过消融实验对比两种损失组合策略的效率,发现等权损失加权方法在本场景中效果最佳。相关代码已开源至 https://github.com/bonaventuredossou/multitask_fon。