Evaluating Language Models as Synthetic Data Generators

Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.

翻译：鉴于合成数据在语言模型后训练中的使用日益增加，语言模型生成高质量数据的能力已变得几乎与其直接解决问题的能力同等重要。虽然先前的研究主要集中于开发有效的数据生成方法，但它们缺乏在统一环境下对不同语言模型作为数据生成器的系统性比较。为弥补这一空白，我们提出了AgoraBench——一个提供标准化设置与评估指标以衡量语言模型数据生成能力的基准测试平台。通过使用6种语言模型合成126万个训练实例并训练99个学生模型，我们揭示了关于语言模型数据生成能力的关键发现。首先，我们观察到不同语言模型展现出独特的优势：例如，GPT-4o擅长生成新问题，而Claude-3.5-Sonnet在增强现有问题方面表现更佳。此外，我们的分析表明，语言模型的数据生成能力与其解决问题能力并不必然相关。相反，数据质量的多个内在特征——包括响应质量、困惑度和指令难度——共同构成了更有效的评估指标。最后，我们论证了输出格式的策略性选择与成本敏感型模型筛选对数据生成效能具有显著影响。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日