Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling Corrector

The progress of Natural Language Processing (NLP), although fast in recent years, is not at the same pace for all languages. African languages in particular are still behind and lack automatic processing tools. Some of these tools are very important for the development of these languages but also have an important role in many NLP applications. This is particularly the case for automatic spell checkers. Several approaches have been studied to address this task and the one modeling spelling correction as a translation task from misspelled (noisy) text to well-spelled (correct) text shows promising results. However, this approach requires a parallel corpus of noisy data on the one hand and correct data on the other hand, whereas Wolof is a low-resource language and does not have such a corpus. In this paper, we present a way to address the constraint related to the lack of data by generating synthetic data and we present sequence-to-sequence models using Deep Learning for spelling correction in Wolof. We evaluated these models in three different scenarios depending on the subwording method applied to the data and showed that the latter had a significant impact on the performance of the models, which opens the way for future research in Wolof spelling correction.

翻译：摘要：自然语言处理（NLP）近年来虽发展迅速，但其进展在不同语言间并不均衡。非洲语言尤其滞后，缺乏自动化处理工具。其中，自动拼写检查器等工具不仅对语言发展至关重要，还在诸多NLP应用中扮演重要角色。针对该任务已有多种研究方法，而将拼写校正建模为从拼写错误（含噪）文本到正确（标准）文本的翻译任务，展现出显著潜力。然而，该方法需要含噪文本与正确文本构成的平行语料库，而沃洛夫语作为低资源语言尚不具备此类资源。本文提出通过生成合成数据来应对数据匮乏的挑战，并展示了基于深度学习的序列到序列模型在沃洛夫语拼写校正中的应用。我们根据数据所采用的分词方法，在三种不同场景下评估了这些模型，结果表明分词方法对模型性能具有显著影响，这为沃洛夫语拼写校正的未来研究开辟了道路。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/