Natural Language Processing (NLP) research has made great advancements in recent years with major breakthroughs that have established new benchmarks. However, these advances have mainly benefited a certain group of languages commonly referred to as resource-rich such as English and French. Majority of other languages with weaker resources are then left behind which is the case for most African languages including Wolof. In this work, we present a parallel Wolof/French corpus of 123,000 sentences on which we conducted experiments on machine translation models based on Recurrent Neural Networks (RNN) in different data configurations. We noted performance gains with the models trained on subworded data as well as those trained on the French-English language pair compared to those trained on the French-Wolof pair under the same experimental conditions.
翻译:自然语言处理(NLP)研究近年取得了重大进展,多项突破性成果建立了新的基准。然而,这些进步主要惠及英语、法语等所谓资源丰富的语言群体。大多数资源较弱的语言因此被边缘化,包括沃洛夫语在内的多数非洲语言皆属此列。本文构建了一个包含12.3万个句子的沃洛夫语/法语平行语料库,并基于循环神经网络(RNN)在不同数据配置下进行了机器翻译模型实验。研究发现在相同实验条件下,基于子词数据训练的模型以及法语-英语语言对训练的模型,其性能均优于法语-沃洛夫语对训练的模型。