Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.
翻译:评估偏好优化算法在大语言模型对齐任务中的性能是一项具有挑战性的工作,其成本高昂、噪声显著,且涉及模型规模与超参数等多个变量。本研究表明,通过在更简单的基准测试中进行分析,能够有效洞察偏好优化算法的效能。我们设计了一套基于MuJoCo环境的诊断性任务与数据集,用以系统评估偏好优化算法,从而建立一个更可控且成本更低的基准测试框架。随后,我们提出了一类基于镜像下降的新型偏好优化算法,称为镜像偏好优化算法。通过进化策略,我们在此算法类中进行搜索,以发现针对特定偏好数据集特性(如混合质量数据或含噪数据)的专用算法。实验证明,在目标MuJoCo场景中,我们所发现的偏好优化算法性能优于所有已知算法。最后,基于从MuJoCo实验中获得的洞见,我们设计了一种偏好优化算法,该算法在大语言模型对齐任务中显著超越了现有基线方法。