A Systematic Examination of Preference Learning through the Lens of Instruction-Following

Preference learning is a widely adopted post-training technique that aligns large language models (LLMs) to human preferences and improves specific downstream task capabilities. In this work we systematically investigate how specific attributes of preference datasets affect the alignment and downstream performance of LLMs in instruction-following tasks. We use a novel synthetic data generation pipeline to generate 48,000 unique instruction-following prompts with combinations of 23 verifiable constraints that enable fine-grained and automated quality assessments of model responses. With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS) - to obtain pairs of (chosen, rejected) responses. Then, we perform experiments investigating the effects of (1) the presence of shared prefixes between the chosen and rejected responses, (2) the contrast and quality of the chosen, rejected responses and (3) the complexity of the training prompts. Our experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements and greater stability across challenging training configurations. High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance by balancing diversity and learning efficiency. Additionally, training on prompts of moderate difficulty leads to better generalization across tasks, even for more complex evaluation scenarios, compared to overly challenging prompts. Our findings provide actionable insights into optimizing preference data curation for instruction-following tasks, offering a scalable and effective framework for enhancing LLM training and alignment.

翻译：偏好学习是一种广泛采用的训练后技术，用于使大语言模型（LLMs）与人类偏好对齐，并提升其在特定下游任务中的能力。本研究系统性地探究了偏好数据集的特定属性如何影响LLMs在指令遵循任务中的对齐效果与下游性能。我们采用一种新颖的合成数据生成流程，生成了48,000个独特的指令遵循提示，这些提示结合了23种可验证的约束条件，从而能够对模型响应进行细粒度、自动化的质量评估。基于这些合成提示，我们采用两种偏好数据集构建方法——拒绝采样（RS）和蒙特卡洛树搜索（MCTS）——来获取（被选中，被拒绝）的响应对。随后，我们进行了一系列实验，研究以下因素的影响：（1）被选中与被拒绝响应之间共享前缀的存在；（2）被选中与被拒绝响应的对比度与质量；（3）训练提示的复杂度。实验结果表明，由MCTS生成的偏好对中存在的共享前缀，能带来微小但一致的性能提升，并在具有挑战性的训练配置下提供更强的稳定性。高对比度的偏好对通常优于低对比度对；然而，结合两者往往能通过平衡多样性与学习效率，实现最佳性能。此外，与使用过于困难的提示进行训练相比，在中等难度提示上进行训练，即使面对更复杂的评估场景，也能带来更好的跨任务泛化能力。我们的研究结果为优化指令遵循任务的偏好数据构建提供了可行的见解，并为增强LLM训练与对齐提供了一个可扩展且有效的框架。