TIGQA:An Expert Annotated Question Answering Dataset in Tigrinya

The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.

翻译：缺乏专门定制且可公开访问的标注数据集，是低资源语言自然语言处理任务面临的主要障碍。本研究首先探索了利用机器翻译技术将现有数据集转换为SQuAD格式提格雷尼亚语数据集的可行性。基于此，我们提出了TIGQA——一个由专家标注的教育类数据集，包含涵盖气候、水、交通等122个不同主题的2.68K个问答对。这些问答对来源于537段上下文段落，这些段落取自公开的提格雷尼亚语和生物学教材。通过综合分析，我们证明TIGQA数据集所需的技能超越了简单的词汇匹配，要求具备单句推理和多句推理能力。我们采用最先进的机器阅读理解方法进行实验，这是首次在TIGQA数据集上探索此类模型。此外，我们估算了人类在该数据集上的表现，并将其与预训练模型的结果进行对比。人类表现与最优模型表现之间的显著差异表明，通过持续研究，TIGQA仍有进一步提升的空间。我们通过提供的链接免费公开该数据集，以鼓励研究社区应对提格雷尼亚语机器阅读理解领域的挑战。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日