Large Language Models (LLMs) have demonstrated impressive zero shot performance on a wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A relevant application is to use them for creating high quality synthetic datasets for downstream tasks. In this work, we probe whether GPT-4 can be used to augment existing extractive reading comprehension datasets. Automating data annotation processes has the potential to save large amounts of time, money and effort that goes into manually labelling datasets. In this paper, we evaluate the performance of GPT-4 as a replacement for human annotators for low resource reading comprehension tasks, by comparing performance after fine tuning, and the cost associated with annotation. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and challenges. Additionally, we release augmented versions of low resource datasets, that will allow the research community to create further benchmarks for evaluation of generated datasets.
翻译:大型语言模型在各类自然语言处理任务中展现出卓越的零样本性能,具备推理和运用常识的能力。将此类模型用于创建下游任务的高质量合成数据集是一项重要应用。本研究旨在探讨GPT-4能否用于扩充现有抽取式阅读理解数据集。自动化数据标注流程有望节省人工标注数据集所需的大量时间、资金和精力。本文通过比较微调后的性能表现及相关标注成本,评估GPT-4在低资源阅读理解任务中替代人工标注者的效果。该研究首次系统分析了大语言模型作为问答系统合成数据增强器的可行性,揭示了其独特的机遇与挑战。此外,我们发布了低资源数据集的增强版本,这将帮助研究社区建立更多评估生成数据集的基准。