Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks

Obtaining and annotating data can be expensive and time-consuming, especially in complex, low-resource domains. We use GPT-4 and ChatGPT to augment small labeled datasets with synthetic data via simple prompts, in three different classification tasks with varying complexity. For each task, we randomly select a base sample of 500 texts to generate 5,000 new synthetic samples. We explore two augmentation strategies: one that preserves original label distribution and another that balances the distribution. Using a progressively larger training sample size, we train and evaluate a 110M parameter multilingual language model on the real and synthetic data separately. We also test GPT-4 and ChatGPT in a zero-shot setting on the test sets. We observe that GPT-4 and ChatGPT have strong zero-shot performance across all tasks. We find that data augmented with synthetic samples yields a good downstream performance, and particularly aids in low-resource settings, such as in identifying rare classes. Human-annotated data exhibits a strong predictive power, overtaking synthetic data in two out of the three tasks. This finding highlights the need for more complex prompts for synthetic datasets to consistently surpass human-generated ones.

翻译：获取和标注数据可能既昂贵又耗时，尤其是在复杂、低资源领域。我们使用GPT-4和ChatGPT通过简单提示生成合成数据来扩充小型标注数据集，应用于三个复杂度不同的分类任务。对每个任务，我们随机选取500篇文本作为基础样本，生成5000个新合成样本。我们探索两种增强策略：一种保持原始标签分布，另一种平衡分布。通过逐步扩大训练样本规模，我们分别使用真实数据和合成数据训练并评估一个1.1亿参数的多语言语言模型。我们还在零样本设置下对测试集测试了GPT-4和ChatGPT。观察到GPT-4和ChatGPT在所有任务中均表现出强大的零样本性能。我们发现使用合成样本增强的数据能带来良好的下游性能，尤其在低资源场景（如识别稀有类别）中效果显著。人工标注的数据表现出强大的预测能力，在三个任务中的两个超越合成数据。这一发现凸显了为合成数据集设计更复杂提示以实现持续超越人工数据的需求。

相关内容

GPT-4

关注 29

北京时间2023年3月15日凌晨，ChatGPT开发商OpenAI 发布了发布了全新的多模态预训练大模型 GPT-4，可以更可靠、更具创造力、能处理更细节的指令，根据图片和文字提示都能生成相应内容。具体来说来说，GPT-4 相比上一代的模型，实现了飞跃式提升：支持图像和文本输入，拥有强大的识图能力；大幅提升了文字输入限制，在ChatGPT模式下，GPT-4可以处理超过2.5万字的文本，可以处理一些更加细节的指令；回答准确性也得到了显著提高。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日