Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Alex Havrilla,Andrew Dai,Laura O'Mahony,Koen Oostermeijer,Vera Zisler,Alon Albalak,Fabrizio Milo,Sharath Chandra Raparthy,Kanishk Gandhi,Baber Abbasi,Duy Phung,Maia Iyer,Dakota Mahan,Chase Blagden,Srishti Gureja,Mohammed Hamdy,Wen-Ding Li,Giovanni Paolini,Pawan Sasanka Ammanamanchi,Elliot Meyerson

Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.

翻译：利用大型语言模型生成合成数据是一种极具前景的范式，可在近乎无限的任务范围内对自然数据进行增强。鉴于这种多样性，合成数据生成算法之间的直接比较研究较为匮乏，导致难以厘清改进来源与瓶颈所在。我们提出通过评估各算法所生成合成数据在质量、多样性和复杂性三个维度的构成来评价算法性能。选择这三个特征是因为它们在开放式生成过程中的重要性及其对下游模型能力的显著影响。研究发现：质量对分布内模型泛化至关重要，多样性对分布外泛化不可或缺，而复杂性对两者均有裨益。此外，我们重点揭示了训练数据中存在的质量-多样性权衡现象及其对模型性能的下游影响。随后，我们系统考察了合成数据流水线中各类组件对各数据特征的影响机制。通过分析算法采用的组件及其对数据QDC（质量-多样性-复杂性）构成的影响效应，我们实现了对合成数据生成算法的分类与比较。该分析进一步延伸至讨论合成数据中平衡QDC对于高效强化学习与自我改进算法的重要性。类比于训练数据中的QD权衡，模型输出质量与输出多样性之间往往也存在权衡关系，这种权衡直接影响合成数据的构成。我们观察到当前许多模型仅针对输出质量进行评估和优化，从而限制了输出多样性及自我改进潜力。我们认为平衡这些权衡关系对未来自我改进算法的发展至关重要，并重点介绍了一系列在此方向取得进展的研究工作。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日