In this paper, we investigate the use of data obtained from prompting a large generative language model, ChatGPT, to generate synthetic training data with the aim of augmenting data in low resource scenarios. We show that with appropriate task-specific ChatGPT prompts, we outperform the most popular existing approaches for such data augmentation. Furthermore, we investigate methodologies for evaluating the similarity of the augmented data generated from ChatGPT with the aim of validating and assessing the quality of the data generated.
翻译:本文探究了通过提示大型生成语言模型ChatGPT获取数据的方法,旨在生成合成训练数据以扩充低资源场景下的数据集。研究表明,通过设计特定任务的ChatGPT提示,我们能够超越现有最流行的数据增强方法。此外,本文还研究了评估ChatGPT生成增强数据相似性的方法论,以验证和评估所生成数据的质量。