Large Language Models (LLMs) have demonstrated impressive zero shot performance on a wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A relevant application is to use them for creating high quality synthetic datasets for downstream tasks. In this work, we probe whether GPT-4 can be used to augment existing extractive reading comprehension datasets. Automating data annotation processes has the potential to save large amounts of time, money and effort that goes into manually labelling datasets. In this paper, we evaluate the performance of GPT-4 as a replacement for human annotators for low resource reading comprehension tasks, by comparing performance after fine tuning, and the cost associated with annotation. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and challenges. Additionally, we release augmented versions of low resource datasets, that will allow the research community to create further benchmarks for evaluation of generated datasets.
翻译:大型语言模型(LLMs)在广泛的自然语言处理任务中展现出令人印象深刻的零样本性能,证明了其推理和运用常识的能力。一个相关应用是利用它们为下游任务创建高质量的合成数据集。在本研究中,我们探讨了GPT-4能否用于增强现有的抽取式阅读理解数据集。自动化数据标注流程有望节省大量投入于人工标注数据集的时间、资金和精力。本文通过比较微调后的性能及标注成本,评估了GPT-4在低资源阅读理解任务中替代人类标注者的表现。这项工作首次系统分析了LLMs作为问答系统合成数据增强工具的潜力,揭示了独特的机遇与挑战。此外,我们发布了低资源数据集的增强版本,这将助力研究社区建立更完善的生成数据集评估基准。