We present an empirical study investigating how specific properties of preference datasets, such as mixed-quality or noisy data, affect the performance of Preference Optimization (PO) algorithms. Our experiments, conducted in MuJoCo environments, reveal several scenarios where state-of-the-art PO methods experience significant drops in performance. To address this issue, we introduce a novel PO framework based on mirror descent, which can recover existing methods like Direct Preference Optimization (DPO) and Odds-Ratio Preference Optimization (ORPO) for specific choices of the mirror map. Within this framework, we employ evolutionary strategies to discover new loss functions capable of handling the identified problematic scenarios. These new loss functions lead to significant performance improvements over DPO and ORPO across several tasks. Additionally, we demonstrate the generalization capability of our approach by applying the discovered loss functions to fine-tuning large language models using mixed-quality data, where they outperform ORPO.
翻译:我们通过一项实证研究,探讨了偏好数据集的具体属性(如混合质量或噪声数据)如何影响偏好优化算法的性能。在MuJoCo环境中进行的实验揭示了若干场景,其中先进的PO方法会出现显著的性能下降。为解决此问题,我们引入了一种基于镜像下降的新型PO框架,该框架可通过特定镜像映射的选择,恢复现有方法如直接偏好优化和几率比偏好优化。在此框架内,我们采用进化策略来发现能够处理已识别问题场景的新损失函数。这些新损失函数在多项任务中带来了相较于DPO和ORPO的显著性能提升。此外,我们通过将发现的损失函数应用于使用混合质量数据微调大语言模型,展示了我们方法的泛化能力,其表现优于ORPO。