Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning - 专知论文

会员服务 ·

0

AlphaZero · 可约的 · 稳健性 · Learning · 可辨认的 ·

2023 年 1 月 27 日

Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning

翻译：基于搜索的多智能体学习中的策略-价值对齐与鲁棒性

Niko A. Grupen,Michael Hanlon,Alexis Hao,Daniel D. Lee,Bart Selman

from arxiv, 9 pages, 5 figures

Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.

翻译：大规模结合搜索与学习的人工智能系统在游戏对弈中已达到超人水平，但也被发现存在出人意料的失败模式。此类模型的脆弱性限制了其在现实部署中的效能与可信度。本研究系统性地分析了AlphaZero算法，并识别出两种与探索本质相关的现象。首先，我们发现了策略-价值失调的证据——对于许多状态，AlphaZero的策略与价值预测相互矛盾，揭示了其目标函数中精准动作选择与价值估计之间的张力。进一步地，我们发现了AlphaZero价值函数内部的非一致性，这导致其泛化能力较差，尽管其策略能够执行最优方案。基于上述洞见，我们提出了VISA-VIS方法：一种提升AlphaZero策略-价值对齐与价值鲁棒性的新型方法。实验表明，我们的方法可将策略-价值失调降低高达76%，价值泛化误差降低高达50%，平均价值误差降低高达55%。

0

相关内容

AlphaZero

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

专知

17+阅读 · 2018年4月28日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Progerin/PrelaminA诱发早老症的蛋白质组学研究

国家自然科学基金

1+阅读 · 2015年12月31日

脂肪细胞来源Microparticles介导新生血管生成在2型糖尿病易损斑块形成中的作用

国家自然科学基金

0+阅读 · 2014年12月31日

面向大数据的信息可视化设计方法研究

国家自然科学基金

7+阅读 · 2014年12月31日

长链非编码RNA-VEC1340靶定KLF4在血管内皮细胞损伤中的调控及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

miR-140促进CYP2J2基因表达对动脉粥样硬化中血管炎症的调控作用及机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于农田环境要素空间变异特征的多尺度网格化精确管理分区研究

国家自然科学基金

0+阅读 · 2013年12月31日

铼离子液体溶液的热力学性质研究

国家自然科学基金

0+阅读 · 2012年12月31日

有限域上的三角和与和集问题

国家自然科学基金

0+阅读 · 2012年12月31日

3'-UTR单核苷酸多态性影响CYP8B1基因表达致胆囊胆固醇结石形成的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

可压Navier-Stokes方程及相关流体动力学方程研究

国家自然科学基金

0+阅读 · 2008年12月31日

A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations

Arxiv

0+阅读 · 2023年3月20日

Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

Arxiv

0+阅读 · 2023年3月20日

Active Exploration for Inverse Reinforcement Learning

Arxiv

0+阅读 · 2023年3月20日

Causal Confusion and Reward Misidentification in Preference-Based Reward Learning

Arxiv

0+阅读 · 2023年3月18日

Learning Local Heuristics for Search-Based Navigation Planning

Learning Local Heuristics for Search-Based Navigation Planning

Arxiv

0+阅读 · 2023年3月16日

Eco-evolutionary Dynamics of Non-episodic Neuroevolution in Large Multi-agent Environments

Arxiv

0+阅读 · 2023年3月16日

Self-Inspection Method of Unmanned Aerial Vehicles in Power Plants Using Deep Q-Network Reinforcement Learning

Arxiv

0+阅读 · 2023年3月16日

Pretraining in Deep Reinforcement Learning: A Survey

Arxiv

21+阅读 · 2022年11月8日

Transformers are Meta-Reinforcement Learners

Arxiv

15+阅读 · 2022年6月14日

Dynamic neighbourhood optimisation for task allocation using multi-agent

Arxiv

102+阅读 · 2022年5月11日

VIP会员

文章信息

相关主题

最新内容

美国从乌克兰无人机战争中学习经验

美国从乌克兰无人机战争中学习经验

专知会员服务

1+阅读 · 今天15:03

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

专知会员服务

0+阅读 · 今天14:31

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

专知会员服务

0+阅读 · 今天14:29

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

专知会员服务

12+阅读 · 6月20日

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

ICML 2026 | 多任务贝叶斯上下文学习：让 Transformer 在测试时显式适应新先验

专知会员服务

4+阅读 · 6月19日

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

ACL 2026综述 | 大规模手语数据集：资源、基准与标注标准

专知会员服务

7+阅读 · 6月19日

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

ICML 2026 Spotlight | SmoothSMoE：解析稀疏 MoE 路由不连续

专知会员服务

6+阅读 · 6月18日

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

综述 | 周期表视角下的大模型推理：范式、方法与失败模式

专知会员服务

8+阅读 · 6月18日

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

《廉价自杀式无人机战争的军事战略影响：乌克兰和伊朗案例研究》

专知会员服务

11+阅读 · 6月18日

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

《面向反无人机作战的联邦式可解释射频–光电/红外情报融合：边缘人工智能优化、电子战韧性及分布式监视验证》

专知会员服务

11+阅读 · 6月18日

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

ICML 2026 | FR3D：解耦自车运动的未来动态三维重建世界模型

专知会员服务

7+阅读 · 6月17日

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

【伯克利博士论文】迈向可扩展与自我演进的大语言模型智能体

专知会员服务

12+阅读 · 6月17日

学习数据的几何：形状空间分析数学综述

学习数据的几何：形状空间分析数学综述

专知会员服务

8+阅读 · 6月17日

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

《现代防空系统综述：架构、传感器、拦截器及新兴威胁环境对基础设施受限防御环境的影响》2026最新长综述

专知会员服务

21+阅读 · 6月17日

定向能反无人机系统最新发展动态

定向能反无人机系统最新发展动态

专知会员服务

10+阅读 · 6月17日

相关VIP内容

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

ICML 2026 | 面向视觉语言模型的语义鲁棒性认证

深入解读 Palantir AIP：全球最具争议的人工智能平台究竟如何运作

美国从乌克兰无人机战争中学习经验

综述 | 智能体电子设计自动化：从“交接有效性”重新理解Agentic EDA

相关资讯

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

【论文推荐】最新六篇强化学习相关论文—Sublinear、机器阅读理解、加速强化学习、对抗性奖励学习、人机交互

专知

17+阅读 · 2018年4月28日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations

Arxiv

0+阅读 · 2023年3月20日

Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

Arxiv

0+阅读 · 2023年3月20日

Active Exploration for Inverse Reinforcement Learning

Arxiv

0+阅读 · 2023年3月20日

Causal Confusion and Reward Misidentification in Preference-Based Reward Learning

Arxiv

0+阅读 · 2023年3月18日

Learning Local Heuristics for Search-Based Navigation Planning

Learning Local Heuristics for Search-Based Navigation Planning

Arxiv

0+阅读 · 2023年3月16日

Eco-evolutionary Dynamics of Non-episodic Neuroevolution in Large Multi-agent Environments

Arxiv

0+阅读 · 2023年3月16日

Self-Inspection Method of Unmanned Aerial Vehicles in Power Plants Using Deep Q-Network Reinforcement Learning

Arxiv

0+阅读 · 2023年3月16日

Pretraining in Deep Reinforcement Learning: A Survey

Arxiv

21+阅读 · 2022年11月8日

Transformers are Meta-Reinforcement Learners

Arxiv

15+阅读 · 2022年6月14日

Dynamic neighbourhood optimisation for task allocation using multi-agent

Arxiv

102+阅读 · 2022年5月11日

相关基金

Progerin/PrelaminA诱发早老症的蛋白质组学研究

国家自然科学基金

1+阅读 · 2015年12月31日

脂肪细胞来源Microparticles介导新生血管生成在2型糖尿病易损斑块形成中的作用

国家自然科学基金

0+阅读 · 2014年12月31日

面向大数据的信息可视化设计方法研究

国家自然科学基金

7+阅读 · 2014年12月31日

长链非编码RNA-VEC1340靶定KLF4在血管内皮细胞损伤中的调控及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

miR-140促进CYP2J2基因表达对动脉粥样硬化中血管炎症的调控作用及机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于农田环境要素空间变异特征的多尺度网格化精确管理分区研究

国家自然科学基金

0+阅读 · 2013年12月31日

铼离子液体溶液的热力学性质研究

国家自然科学基金

0+阅读 · 2012年12月31日

有限域上的三角和与和集问题

国家自然科学基金

0+阅读 · 2012年12月31日

3'-UTR单核苷酸多态性影响CYP8B1基因表达致胆囊胆固醇结石形成的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

可压Navier-Stokes方程及相关流体动力学方程研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员