Recent work suggests that preference-tuning techniques -- such as Reinforcement Learning from Human Feedback (RLHF) methods like PPO and GRPO, as well as alternatives like DPO -- reduce diversity, creating a dilemma given that these models are widely deployed in applications requiring varied outputs. We argue that diversity without consideration of quality has limited practical value. To address this issue, we introduce a framework for measuring effective semantic diversity -- diversity among outputs that meet quality thresholds -- which better reflects the practical utility of large language models (LLMs). Using open-ended tasks that require no human intervention, we find counterintuitive results: when using diversity metrics that do not explicitly consider quality, preference-tuned models -- particularly those trained via RL -- often produce outputs with lower diversity; however, these same preference-tuned models generate greater effective semantic diversity than supervised fine-tuned (SFT) or base models. Our analysis further shows another trend: while larger models may exhibit greater effective semantic diversity than smaller models, the smaller models are consistently more parameter-efficient at producing unique content within a fixed sampling budget. These findings have practical implications for applications that require diverse yet high-quality outputs, from creative assistance to synthetic data generation.
翻译:近期研究表明,偏好调优技术——例如基于人类反馈的强化学习方法(如PPO和GRPO)以及DPO等替代方案——会降低输出多样性,鉴于这些模型已广泛部署在需要多样化输出的应用中,这构成了一个两难问题。我们认为不考虑质量的多样性其实际价值有限。为解决这一问题,我们提出了一个衡量有效语义多样性的框架——即满足质量阈值的输出之间的多样性——这更能反映大语言模型的实际效用。通过使用无需人工干预的开放式任务,我们发现了反直觉的结果:当使用未显式考虑质量的多样性指标时,偏好调优模型(特别是通过强化学习训练的模型)通常产生多样性较低的输出;然而,这些相同的偏好调优模型却比监督微调模型或基础模型生成更大的有效语义多样性。我们的分析进一步揭示了另一个趋势:虽然更大模型可能比较小模型展现出更大的有效语义多样性,但在固定采样预算下,较小模型在生成独特内容方面始终具有更高的参数效率。这些发现对需要多样化且高质量输出的应用具有实际意义,从创意辅助到合成数据生成皆然。