Iterative data generation and model re-training can effectively align large language models(LLMs) to human preferences. The process of data sampling is crucial, as it significantly influences the success of policy improvement. Repeated random sampling is a widely used method that independently queries the model multiple times to generate outputs. In this work, we propose a more effective sampling method, named Preference-Guided Reflective Sampling (PRS). Unlike random sampling, PRS employs a tree-based generation framework to enable more efficient sampling. It leverages adaptive self-refinement techniques to better explore the sampling space. By specifying user preferences in natural language, PRS can further optimize response generation according to these preferences. As a result, PRS can align models to diverse user preferences. Our experiments demonstrate that PRS generates higher-quality responses with significantly higher rewards. On AlpacaEval and Arena-Hard, PRS substantially outperforms repeated random sampling in best-of-$N$ sampling. Moreover, PRS shows strong performance when applied in iterative offline RL training.
翻译:迭代数据生成与模型重训练能有效将大语言模型(LLM)与人类偏好对齐。数据采样过程至关重要,因其显著影响策略改进的成功率。重复随机采样是一种广泛使用的方法,它通过多次独立查询模型以生成输出。本文提出一种更有效的采样方法,称为偏好引导的反思性采样(PRS)。与随机采样不同,PRS采用基于树的生成框架以实现更高效的采样。该方法利用自适应自优化技术来更好地探索采样空间。通过以自然语言指定用户偏好,PRS能依据这些偏好进一步优化响应生成。因此,PRS能够将模型与多样化的用户偏好对齐。实验表明,PRS能生成更高质量且奖励值显著更高的响应。在AlpacaEval和Arena-Hard基准测试中,PRS在best-of-$N$采样中大幅优于重复随机采样。此外,PRS在迭代离线强化学习训练中也展现出强劲性能。