Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.
翻译:以GPT-4和Llama3为代表的大语言模型(LLMs)通过实现高质量合成数据生成,显著降低了对昂贵人工标注数据集的依赖,并对多个领域产生了深远影响。然而,现有生成框架在泛化性、可控性、多样性和真实性方面仍面临挑战。为应对这些挑战,本文提出了UniGen——一个全面基于大语言模型的框架,旨在生成多样化、高精度且高度可控的数据集。UniGen具备高度适应性,支持所有类型的文本数据集,并通过创新机制优化生成流程。为提升数据多样性,UniGen集成了属性引导生成模块和群体校验功能;在准确性方面,该框架采用基于代码的数学评估进行标签验证,并结合检索增强生成技术实现事实核查。同时,UniGen支持用户自定义约束条件,使数据生成过程能够根据特定需求进行定制化调整。大量实验证明,UniGen生成的数据质量显著优于现有方法,且框架中的每个模块都对质量提升起到了关键作用。此外,本研究将UniGen应用于两个实际场景:大语言模型基准测试与数据增强。结果表明,UniGen能有效支持动态演进的基准测试,而通过数据增强可全面提升大语言模型在多个领域的能力,包括智能体导向能力与推理技能。