The rapid advancement of large language models (LLMs) has sparked interest in data synthesis techniques, aiming to generate diverse and high-quality synthetic datasets. However, these synthetic datasets often suffer from a lack of diversity and added noise. In this paper, we present TarGEN, a multi-step prompting strategy for generating high-quality synthetic datasets utilizing a LLM. An advantage of TarGEN is its seedless nature; it does not require specific task instances, broadening its applicability beyond task replication. We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances during dataset creation, ensuring reliable labels. To assess our technique's effectiveness, we emulate 8 tasks from the SuperGLUE benchmark and finetune various language models, including encoder-only, encoder-decoder, and decoder-only models on both synthetic and original training sets. Evaluation on the original test set reveals that models trained on datasets generated by TarGEN perform approximately 1-2% points better than those trained on original datasets (82.84% via syn. vs. 81.12% on og. using Flan-T5). When incorporating instruction tuning, the performance increases to 84.54% on synthetic data vs. 81.49% on original data by Flan-T5. A comprehensive analysis of the synthetic dataset compared to the original dataset reveals that the synthetic dataset demonstrates similar or higher levels of dataset complexity and diversity. Furthermore, the synthetic dataset displays a bias level that aligns closely with the original dataset. Finally, when pre-finetuned on our synthetic SuperGLUE dataset, T5-3B yields impressive results on the OpenLLM leaderboard, surpassing the model trained on the Self-Instruct dataset by 4.14% points. We hope that TarGEN can be helpful for quality data generation and reducing the human efforts to create complex benchmarks.
翻译:大型语言模型(LLMs)的快速发展激发了人们对数据合成技术的兴趣,旨在生成多样化的高质量合成数据集。然而,这些合成数据集往往存在多样性不足和噪声增加的问题。本文提出TarGEN,一种利用LLM生成高质量合成数据集的多步提示策略。TarGEN的一个优势在于其无需种子样本的特性;它不需要特定的任务实例,从而将其适用性扩展到任务复现之外。我们通过一种称为自校正的方法增强TarGEN,使LLM能够在数据集创建过程中纠正错误标记的实例,从而确保标签的可靠性。为评估我们技术的有效性,我们模拟了SuperGLUE基准测试中的8个任务,并在合成训练集和原始训练集上对多种语言模型(包括仅编码器、编码器-解码器和仅解码器模型)进行了微调。在原始测试集上的评估表明,使用TarGEN生成的数据集训练的模型性能比使用原始数据集训练的模型高出约1-2个百分点(使用Flan-T5时,合成数据为82.84% vs. 原始数据81.12%)。当结合指令微调时,Flan-T5在合成数据上的性能提升至84.54%,而在原始数据上为81.49%。对合成数据集与原始数据集的综合分析表明,合成数据集展现出相似或更高的数据集复杂性和多样性水平。此外,合成数据集的偏差水平与原始数据集高度一致。最后,当在我们的合成SuperGLUE数据集上进行预微调后,T5-3B在OpenLLM排行榜上取得了令人印象深刻的结果,比在Self-Instruct数据集上训练的模型高出4.14个百分点。我们希望TarGEN能够有助于高质量数据生成,并减少创建复杂基准测试所需的人力投入。