Yor\`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YOR\`ULECT across three domains and four regional Yor\`ub\'a dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yor\`ub\'a and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yor\`ub\'a and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YOR\`ULECT dataset and models publicly under an open license.
翻译:约鲁巴语是一种拥有约4700万使用者的非洲语言,包含多种方言构成的连续体。近期针对非洲语言开发自然语言处理技术的努力主要集中于其标准方言,导致那些资源或工具匮乏的方言和变体面临发展差距。为弥合这一鸿沟,我们通过构建跨三个领域、涵盖四种约鲁巴语区域方言的高质量平行文本与语音语料库YOR`ULECT迈出了关键一步。在语料库建设过程中,我们深入方言使用社区,组织母语者进行文本与语音数据采集。基于新建语料库,我们在(文本)机器翻译、自动语音识别及语音到文本翻译任务上开展了系统性实验。实验结果表明,在所有任务中标准约鲁巴语与其他方言间存在显著的性能差距。然而我们也发现,通过方言自适应微调能够有效缩小这一差距。我们相信,该数据集与实验分析将通过对现有挑战的深入阐释及高质量数据资源的提供,极大推动约鲁巴语及其方言的自然语言处理工具开发,并可能惠及其他非洲语言研究。YOR`ULECT数据集与模型已通过开放许可协议公开发布。