The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing existing datasets for a new NLP task: we repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced. We also conduct a thorough analysis and share our insights from the process, which we hope will contribute to a deeper understanding of the challenges and the opportunities associated with task reformulation in NLP research.
翻译:自然语言处理(NLP)领域的快速发展主要惠及英语等主流语言,而由于资源有限,许多其他语言仍面临显著差距。这在数据标注任务中尤为突出——该任务的重要性不容低估,但耗时且成本高昂。因此,对于资源匮乏的语言而言,任何数据集都弥足珍贵,尤其是任务特定的数据集。本研究探索了将现有数据集重新用于新NLP任务的可行性:我们将原本为多项选择问答(MCQA)设计的Belebele数据集(Bandarkar 等人,2023)重新用于机器阅读理解风格的抽取式问答(EQA),并提出了标注指南以及针对英语和现代标准阿拉伯语(MSA)的平行EQA数据集。此外,我们还展示了针对英语、MSA及五种阿拉伯方言在内的多组单语与跨语言问答对的评估结果。本研究旨在使他人能够将我们的方法迁移至Belebele中120余种其他语言变体(其中多数被视为低资源语言)。我们同时开展了深入分析并分享实践见解,期望这些成果能有助于更深刻地理解NLP任务重新定义过程中面临的挑战与机遇。