The rapid advancement of large language models (LLMs) has sparked interest in data synthesis techniques, aiming to generate diverse and high-quality synthetic datasets. However, these synthetic datasets often suffer from a lack of diversity and added noise. In this paper, we present TarGEN, a multi-step prompting strategy for generating high-quality synthetic datasets utilizing a LLM. An advantage of TarGEN is its seedless nature; it does not require specific task instances, broadening its applicability beyond task replication. We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances during dataset creation, ensuring reliable labels. To assess our technique's effectiveness, we emulate 8 tasks from the SuperGLUE benchmark and finetune various language models, including encoder-only, encoder-decoder, and decoder-only models on both synthetic and original training sets. Evaluation on the original test set reveals that models trained on datasets generated by TarGEN perform approximately 1-2% points better than those trained on original datasets (82.84% via syn. vs. 81.12% on og. using Flan-T5). When incorporating instruction tuning, the performance increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A comprehensive analysis of the synthetic dataset compared to the original dataset reveals that the synthetic dataset demonstrates similar or higher levels of dataset complexity and diversity. Furthermore, the synthetic dataset displays a bias level that aligns closely with the original dataset. Finally, when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive results on the OpenLLM leaderboard, surpassing the model trained on the Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for quality data generation and reducing the human efforts to create complex benchmarks.
翻译:大语言模型的快速发展激发了数据合成技术的研究兴趣,旨在生成多样化且高质量的数据集。然而,这些合成数据集常面临多样性不足与噪声叠加的问题。本文提出TarGEN——一种利用大语言模型生成高质量合成数据集的多步提示策略。TarGEN的核心优势在于其无种子特性(seedless nature):无需特定任务实例即可运行,从而突破任务复制的应用边界。通过引入自校正机制,TarGEN能在数据创建过程中修正标签有误的实例,确保标签可靠性。为评估该技术效果,我们模拟了SuperGLUE基准中的8项任务,在合成训练集与原始训练集上微调了编码器-only、编码器-解码器及解码器-only等多种语言模型。基于原始测试集的评估显示,经TarGEN生成数据集训练的模型性能比原始数据集训练模型提升约1-2%(Flan-T5:合成数据82.84% vs 原始数据81.12%)。引入指令微调后,Flan-T5在合成数据上的表现提升至84.54%,原始数据则为81.49%。综合对比表明,合成数据集在复杂度和多样性指标上达到甚至超越原始数据集水平,且偏差程度与原始数据集高度吻合。最终,基于合成SuperGLUE数据集进行预微调的T5-3B模型在OpenLLM排行榜上超越Self-Instruct数据集训练模型4.14个百分点。我们期望TarGEN能为高质量数据生成和降低复杂基准构建的人力成本提供有效解决方案。