Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.
翻译:鉴于合成数据在语言模型后训练中的使用日益增加,语言模型生成高质量数据的能力已变得几乎与其直接解决问题的能力同等重要。虽然先前的研究主要集中于开发有效的数据生成方法,但它们缺乏在统一环境下对不同语言模型作为数据生成器的系统性比较。为弥补这一空白,我们提出了AgoraBench——一个提供标准化设置与评估指标以衡量语言模型数据生成能力的基准测试平台。通过使用6种语言模型合成126万个训练实例并训练99个学生模型,我们揭示了关于语言模型数据生成能力的关键发现。首先,我们观察到不同语言模型展现出独特的优势:例如,GPT-4o擅长生成新问题,而Claude-3.5-Sonnet在增强现有问题方面表现更佳。此外,我们的分析表明,语言模型的数据生成能力与其解决问题能力并不必然相关。相反,数据质量的多个内在特征——包括响应质量、困惑度和指令难度——共同构成了更有效的评估指标。最后,我们论证了输出格式的策略性选择与成本敏感型模型筛选对数据生成效能具有显著影响。