The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.
翻译:低资源语言问答数据集的缺失是本研究的主要动因,由此开发了Kencorpus斯瓦希里问答数据集(KenSwQuAD)。该数据集基于斯瓦希里低资源语言的原始故事文本进行标注,斯瓦希里语主要在东非及世界其他地区使用。问答数据集对于实现自然语言的机器理解至关重要,广泛应用于互联网搜索和对话系统等任务。机器学习系统需要如本研究开发的金标准问答集作为训练数据。研究招募标注人员,从Kencorpus项目(肯尼亚语言语料库)收集的斯瓦希里文本中构建问答对。项目从总计2,585篇文本中标注了1,445篇,每篇至少生成5个问答对,最终形成包含7,526个问答对的数据集。占标注文本12.5%的质量保证集验证了所有问答对标注的正确性。将该数据集应用于问答任务的概念验证实验表明,该数据集可有效支持此类任务。KenSwQuAD亦为斯瓦希里语言的资源建设做出了贡献。