The widespread availability of Question Answering (QA) datasets in English has greatly facilitated the advancement of the Natural Language Processing (NLP) field. However, the scarcity of such resources for minority languages, such as Basque, poses a substantial challenge for these communities. In this context, the translation and alignment of existing QA datasets plays a crucial role in narrowing this technological gap. This work presents EuSQuAD, the first initiative dedicated to automatically translating and aligning SQuAD2.0 into Basque, resulting in more than 142k QA examples. We demonstrate EuSQuAD's value through extensive qualitative analysis and QA experiments supported with EuSQuAD as training data. These experiments are evaluated with a new human-annotated dataset.
翻译:英语问答(QA)数据集的广泛可用性极大地推动了自然语言处理(NLP)领域的发展。然而,对于巴斯克语等少数语言而言,此类资源的稀缺性对这些语言社区构成了重大挑战。在此背景下,对现有问答数据集进行翻译与对齐,在缩小这一技术差距方面发挥着关键作用。本研究提出了EuSQuAD,这是首个致力于将SQuAD2.0自动翻译并对齐至巴斯克语的专项工作,最终生成了超过14.2万个问答实例。我们通过深入的定性分析以及以EuSQuAD作为训练数据支持的问答实验,验证了EuSQuAD的价值。这些实验通过新构建的人工标注数据集进行评估。