The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.
翻译:缺乏专门定制且可公开访问的标注数据集,是低资源语言自然语言处理任务面临的主要障碍。本研究首先探索了利用机器翻译技术将现有数据集转换为SQuAD格式提格雷尼亚语数据集的可行性。基于此,我们提出了TIGQA——一个由专家标注的教育类数据集,包含涵盖气候、水、交通等122个不同主题的2.68K个问答对。这些问答对来源于537段上下文段落,这些段落取自公开的提格雷尼亚语和生物学教材。通过综合分析,我们证明TIGQA数据集所需的技能超越了简单的词汇匹配,要求具备单句推理和多句推理能力。我们采用最先进的机器阅读理解方法进行实验,这是首次在TIGQA数据集上探索此类模型。此外,我们估算了人类在该数据集上的表现,并将其与预训练模型的结果进行对比。人类表现与最优模型表现之间的显著差异表明,通过持续研究,TIGQA仍有进一步提升的空间。我们通过提供的链接免费公开该数据集,以鼓励研究社区应对提格雷尼亚语机器阅读理解领域的挑战。