元学习目标下的偏好优化算法 (Meta-Learning Objectives for Preference Optimization) - 专知论文

会员服务 ·

0

算法 · 优化算法 · 偏好优化 · 基准 · 基准测试 ·

Meta-Learning Objectives for Preference Optimization

翻译：元学习目标下的偏好优化算法

Carlo Alfano,Silvia Sapora,Jakob Nicolaus Foerster,Patrick Rebeschini,Yee Whye Teh

Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.

翻译：评估偏好优化算法在大语言模型对齐任务中的性能是一项具有挑战性的工作，其成本高昂、噪声显著，且涉及模型规模与超参数等多个变量。本研究表明，通过在更简单的基准测试中进行分析，能够有效洞察偏好优化算法的效能。我们设计了一套基于MuJoCo环境的诊断性任务与数据集，用以系统评估偏好优化算法，从而建立一个更可控且成本更低的基准测试框架。随后，我们提出了一类基于镜像下降的新型偏好优化算法，称为镜像偏好优化算法。通过进化策略，我们在此算法类中进行搜索，以发现针对特定偏好数据集特性（如混合质量数据或含噪数据）的专用算法。实验证明，在目标MuJoCo场景中，我们所发现的偏好优化算法性能优于所有已知算法。最后，基于从MuJoCo实验中获得的洞见，我们设计了一种偏好优化算法，该算法在大语言模型对齐任务中显著超越了现有基线方法。

0

相关内容

在数学和计算机科学之中，算法（Algorithm）为一个计算的具体步骤，常用于计算、数据处理和自动推理。精确而言，算法是一个表示为有限长列表的有效方法。算法应包含清晰定义的指令用于计算函数。来自维基百科：算法

【阿姆斯特丹博士论文】带约束学习的优化算法

【阿姆斯特丹博士论文】带约束学习的优化算法

专知会员服务

19+阅读 · 2025年4月4日

《直接偏好优化研究综述》

《直接偏好优化研究综述》

专知会员服务

31+阅读 · 2025年3月18日

多样化偏好优化

多样化偏好优化

专知会员服务

12+阅读 · 2025年2月3日

迈向大语言模型偏好学习的统一视角综述

迈向大语言模型偏好学习的统一视角综述

专知会员服务

24+阅读 · 2024年9月7日

Meta-Transformer：多模态学习的统一框架

Meta-Transformer：多模态学习的统一框架

专知会员服务

59+阅读 · 2023年7月21日

元学习(meta learning) 最新进展综述论文

元学习(meta learning) 最新进展综述论文

专知会员服务

281+阅读 · 2020年5月8日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【谷歌大脑新论文】利用可微摄动优化器进行学习，Learning with Differentiable Perturbed Optimizers

【谷歌大脑新论文】利用可微摄动优化器进行学习，Learning with Differentiable Perturbed Optimizers

专知会员服务

29+阅读 · 2020年2月22日

【WWW2020-华为诺亚方舟论文】元学习推荐系统MetaSelector

【WWW2020-华为诺亚方舟论文】元学习推荐系统MetaSelector

专知会员服务

56+阅读 · 2020年2月10日

【元学习 | 论文】元学习聚类，Meta-Learning to Cluster，哥伦比亚大学

【元学习 | 论文】元学习聚类，Meta-Learning to Cluster，哥伦比亚大学

专知会员服务

42+阅读 · 2019年11月21日

【干货书-斯坦福】最优化算法，521页pdf，《Algorithms for Optimization》MIT出版社

【干货书-斯坦福】最优化算法，521页pdf，《Algorithms for Optimization》MIT出版社

专知

58+阅读 · 2020年7月2日

【WWW2020-华为诺亚方舟论文】元学习推荐系统MetaSelector

【WWW2020-华为诺亚方舟论文】元学习推荐系统MetaSelector

专知

11+阅读 · 2020年2月10日

最近必读的六篇【Meta-Learning（元学习）】相关论文和代码

最近必读的六篇【Meta-Learning（元学习）】相关论文和代码

专知

61+阅读 · 2019年11月3日

近期必读的八篇【Meta-Learning（元学习）】相关论文和代码

近期必读的八篇【Meta-Learning（元学习）】相关论文和代码

专知

134+阅读 · 2019年9月15日

机器学习中的最优化算法总结

机器学习中的最优化算法总结

人工智能前沿讲习班

22+阅读 · 2019年3月22日

【资源推荐】元学习（meta-learning）相关文献资源大列表

【资源推荐】元学习（meta-learning）相关文献资源大列表

专知

25+阅读 · 2019年3月6日

学界 | 伯克利、OpenAI等提出基于模型的元策略优化强化学习

学界 | 伯克利、OpenAI等提出基于模型的元策略优化强化学习

机器之心

15+阅读 · 2018年10月21日

深度学习时代的目标检测算法

深度学习时代的目标检测算法

炼数成金订阅号

40+阅读 · 2018年3月19日

推荐｜机器学习中的模型评价、模型选择和算法选择！

推荐｜机器学习中的模型评价、模型选择和算法选择！

全球人工智能

10+阅读 · 2018年2月5日

从浅层模型到深度模型：概览机器学习优化算法

从浅层模型到深度模型：概览机器学习优化算法

机器之心

27+阅读 · 2017年7月9日

超大规模约束优化问题算法及其应用天元数学交流项目

国家自然科学基金

2+阅读 · 2017年12月31日

基于视觉特性的目标检测算法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于图的半监督学习算法研究

国家自然科学基金

5+阅读 · 2015年12月31日

基于排序法和分解的高维多目标演化算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

模糊认知集群优化的聚类算法

国家自然科学基金

9+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

基于异构信息网络的分类算法推荐方法研究

国家自然科学基金

7+阅读 · 2015年12月31日

面向大数据的群体偏好决策分析研究

国家自然科学基金

6+阅读 · 2014年12月31日

压缩感知和稀疏优化中的非凸优化算法设计

国家自然科学基金

2+阅读 · 2014年12月31日

基于领域知识和链路预测的个性化推荐研究

国家自然科学基金

4+阅读 · 2014年12月31日

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Arxiv

0+阅读 · 2月3日

Surrogate Ensemble in Expensive Multi-Objective Optimization via Deep Q-Learning

Arxiv

0+阅读 · 1月31日

How Well Can Preference Optimization Generalize Under Noisy Feedback?

Arxiv

0+阅读 · 1月30日

Optimal Use of Preferences in Artificial Intelligence Algorithms

Arxiv

0+阅读 · 1月26日

One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Arxiv

0+阅读 · 1月26日

MetaCLBench: Meta Continual Learning Benchmark on Resource-Constrained Edge Devices

Arxiv

0+阅读 · 1月25日

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Arxiv

0+阅读 · 1月13日

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Arxiv

0+阅读 · 1月12日

Dynamic Hyperparameter Importance for Efficient Multi-Objective Optimization

Dynamic Hyperparameter Importance for Efficient Multi-Objective Optimization

Arxiv

0+阅读 · 1月6日

MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Arxiv

0+阅读 · 2025年12月30日

VIP会员

文章信息

相关主题

相关VIP内容

【阿姆斯特丹博士论文】带约束学习的优化算法

【阿姆斯特丹博士论文】带约束学习的优化算法

专知会员服务

19+阅读 · 2025年4月4日

《直接偏好优化研究综述》

《直接偏好优化研究综述》

专知会员服务

31+阅读 · 2025年3月18日

多样化偏好优化

多样化偏好优化

专知会员服务

12+阅读 · 2025年2月3日

迈向大语言模型偏好学习的统一视角综述

迈向大语言模型偏好学习的统一视角综述

专知会员服务

24+阅读 · 2024年9月7日

Meta-Transformer：多模态学习的统一框架

Meta-Transformer：多模态学习的统一框架

专知会员服务

59+阅读 · 2023年7月21日

元学习(meta learning) 最新进展综述论文

元学习(meta learning) 最新进展综述论文

专知会员服务

281+阅读 · 2020年5月8日

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

【Mila-Google】使用元学习动态调整源代码模型，On-the-Fly Adaptation of Source Code Models using Meta-Learning

专知会员服务

21+阅读 · 2020年3月28日

【谷歌大脑新论文】利用可微摄动优化器进行学习，Learning with Differentiable Perturbed Optimizers

【谷歌大脑新论文】利用可微摄动优化器进行学习，Learning with Differentiable Perturbed Optimizers

专知会员服务

29+阅读 · 2020年2月22日

【WWW2020-华为诺亚方舟论文】元学习推荐系统MetaSelector

【WWW2020-华为诺亚方舟论文】元学习推荐系统MetaSelector

专知会员服务

56+阅读 · 2020年2月10日

【元学习 | 论文】元学习聚类，Meta-Learning to Cluster，哥伦比亚大学

【元学习 | 论文】元学习聚类，Meta-Learning to Cluster，哥伦比亚大学

专知会员服务

42+阅读 · 2019年11月21日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人机与战争：被忽视的环境影响及无人机保护潜力》

俄罗斯规划未来无人机驱动军队

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

《人工智能、武器与影响力：前沿模型在模拟核危机中展现复杂推理》2026最新46页报告

相关资讯

【干货书-斯坦福】最优化算法，521页pdf，《Algorithms for Optimization》MIT出版社

【干货书-斯坦福】最优化算法，521页pdf，《Algorithms for Optimization》MIT出版社

专知

58+阅读 · 2020年7月2日

【WWW2020-华为诺亚方舟论文】元学习推荐系统MetaSelector

【WWW2020-华为诺亚方舟论文】元学习推荐系统MetaSelector

专知

11+阅读 · 2020年2月10日

最近必读的六篇【Meta-Learning（元学习）】相关论文和代码

最近必读的六篇【Meta-Learning（元学习）】相关论文和代码

专知

61+阅读 · 2019年11月3日

近期必读的八篇【Meta-Learning（元学习）】相关论文和代码

近期必读的八篇【Meta-Learning（元学习）】相关论文和代码

专知

134+阅读 · 2019年9月15日

机器学习中的最优化算法总结

机器学习中的最优化算法总结

人工智能前沿讲习班

22+阅读 · 2019年3月22日

【资源推荐】元学习（meta-learning）相关文献资源大列表

【资源推荐】元学习（meta-learning）相关文献资源大列表

专知

25+阅读 · 2019年3月6日

学界 | 伯克利、OpenAI等提出基于模型的元策略优化强化学习

学界 | 伯克利、OpenAI等提出基于模型的元策略优化强化学习

机器之心

15+阅读 · 2018年10月21日

深度学习时代的目标检测算法

深度学习时代的目标检测算法

炼数成金订阅号

40+阅读 · 2018年3月19日

推荐｜机器学习中的模型评价、模型选择和算法选择！

推荐｜机器学习中的模型评价、模型选择和算法选择！

全球人工智能

10+阅读 · 2018年2月5日

从浅层模型到深度模型：概览机器学习优化算法

从浅层模型到深度模型：概览机器学习优化算法

机器之心

27+阅读 · 2017年7月9日

相关论文

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

Arxiv

0+阅读 · 2月3日

Surrogate Ensemble in Expensive Multi-Objective Optimization via Deep Q-Learning

Arxiv

0+阅读 · 1月31日

How Well Can Preference Optimization Generalize Under Noisy Feedback?

Arxiv

0+阅读 · 1月30日

Optimal Use of Preferences in Artificial Intelligence Algorithms

Arxiv

0+阅读 · 1月26日

One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Arxiv

0+阅读 · 1月26日

MetaCLBench: Meta Continual Learning Benchmark on Resource-Constrained Edge Devices

Arxiv

0+阅读 · 1月25日

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Arxiv

0+阅读 · 1月13日

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization

Arxiv

0+阅读 · 1月12日

Dynamic Hyperparameter Importance for Efficient Multi-Objective Optimization

Dynamic Hyperparameter Importance for Efficient Multi-Objective Optimization

Arxiv

0+阅读 · 1月6日

MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining

Arxiv

0+阅读 · 2025年12月30日

相关基金

超大规模约束优化问题算法及其应用天元数学交流项目

国家自然科学基金

2+阅读 · 2017年12月31日

基于视觉特性的目标检测算法研究

国家自然科学基金

4+阅读 · 2015年12月31日

基于图的半监督学习算法研究

国家自然科学基金

5+阅读 · 2015年12月31日

基于排序法和分解的高维多目标演化算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

模糊认知集群优化的聚类算法

国家自然科学基金

9+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

基于异构信息网络的分类算法推荐方法研究

国家自然科学基金

7+阅读 · 2015年12月31日

面向大数据的群体偏好决策分析研究

国家自然科学基金

6+阅读 · 2014年12月31日

压缩感知和稀疏优化中的非凸优化算法设计

国家自然科学基金

2+阅读 · 2014年12月31日

基于领域知识和链路预测的个性化推荐研究

国家自然科学基金

4+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员