Obtaining and annotating data can be expensive and time-consuming, especially in complex, low-resource domains. We use GPT-4 and ChatGPT to augment small labeled datasets with synthetic data via simple prompts, in three different classification tasks with varying complexity. For each task, we randomly select a base sample of 500 texts to generate 5,000 new synthetic samples. We explore two augmentation strategies: one that preserves original label distribution and another that balances the distribution. Using a progressively larger training sample size, we train and evaluate a 110M parameter multilingual language model on the real and synthetic data separately. We also test GPT-4 and ChatGPT in a zero-shot setting on the test sets. We observe that GPT-4 and ChatGPT have strong zero-shot performance across all tasks. We find that data augmented with synthetic samples yields a good downstream performance, and particularly aids in low-resource settings, such as in identifying rare classes. Human-annotated data exhibits a strong predictive power, overtaking synthetic data in two out of the three tasks. This finding highlights the need for more complex prompts for synthetic datasets to consistently surpass human-generated ones.
翻译:获取和标注数据可能既昂贵又耗时,尤其是在复杂、低资源领域。我们使用GPT-4和ChatGPT通过简单提示生成合成数据来扩充小型标注数据集,应用于三个复杂度不同的分类任务。对每个任务,我们随机选取500篇文本作为基础样本,生成5000个新合成样本。我们探索两种增强策略:一种保持原始标签分布,另一种平衡分布。通过逐步扩大训练样本规模,我们分别使用真实数据和合成数据训练并评估一个1.1亿参数的多语言语言模型。我们还在零样本设置下对测试集测试了GPT-4和ChatGPT。观察到GPT-4和ChatGPT在所有任务中均表现出强大的零样本性能。我们发现使用合成样本增强的数据能带来良好的下游性能,尤其在低资源场景(如识别稀有类别)中效果显著。人工标注的数据表现出强大的预测能力,在三个任务中的两个超越合成数据。这一发现凸显了为合成数据集设计更复杂提示以实现持续超越人工数据的需求。