基于扩散桥批评器的分布强化学习方法 (Distributional Reinforcement Learning with Diffusion Bridge Critics) - 专知论文

会员服务 ·

0

强化学习 · 策略优化 · 模型描述 · 新型 · Q值 ·

Distributional Reinforcement Learning with Diffusion Bridge Critics

翻译：基于扩散桥批评器的分布强化学习方法

Shutong Ding,Yimiao Zhou,Ke Hu,Mokai Pan,Shan Zhong,Yanwei Fu,Jingya Wang,Ye Shi

Recent advances in diffusion-based reinforcement learning (RL) methods have demonstrated promising results in a wide range of continuous control tasks. However, existing works in this field focus on the application of diffusion policies while leaving the diffusion critics unexplored. In fact, since policy optimization fundamentally relies on the critic, accurate value estimation is far more important than policy expressiveness. Furthermore, given the stochasticity of most reinforcement learning tasks, it has been confirmed that the critic is more appropriately depicted with a distributional model. Motivated by these points, we propose a novel distributional RL method with Diffusion Bridge Critics (DBC). DBC directly models the inverse cumulative distribution function (CDF) of the Q value. This allows us to accurately capture the value distribution and prevents it from collapsing into a trivial Gaussian distribution owing to the strong distribution-matching capability of the diffusion bridge. Moreover, we further derive an analytic integral formula to address discretization errors in DBC, which is essential in value estimation. To our knowledge, DBC is the first work to employ the diffusion bridge model as the critic. Notably, DBC is also a plug-and-play component and can be integrated into most existing RL frameworks. Experimental results on MuJoCo robot control benchmarks demonstrate the superiority of DBC compared with previous distributional critic models.

翻译：近年来，基于扩散的强化学习方法在连续控制任务中展现出广阔的应用前景。然而，该领域现有研究主要聚焦于扩散策略的应用，尚未对扩散批评器进行深入探索。事实上，由于策略优化根本上依赖于批评器，准确的价值估计远比策略表达能力更为重要。此外，考虑到多数强化学习任务具有随机性，已有研究证实采用分布模型描述批评器更为恰当。基于上述观点，本文提出一种基于扩散桥批评器的新型分布强化学习方法。该方法直接对Q值的逆累积分布函数进行建模，借助扩散桥强大的分布匹配能力，能够精确捕捉价值分布并防止其退化为平凡的高斯分布。此外，我们进一步推导出解析积分公式以解决该方法中的离散化误差问题，这对价值估计至关重要。据我们所知，这是首次采用扩散桥模型作为批评器的研究。值得注意的是，该方法具有即插即用特性，可与现有大多数强化学习框架集成。在MuJoCo机器人控制基准测试中的实验结果表明，相较于先前的分布批评器模型，该方法展现出显著优势。

0

相关内容

强化学习

强化学习（RL）是机器学习的一个领域，与软件代理应如何在环境中采取行动以最大化累积奖励的概念有关。除了监督学习和非监督学习外，强化学习是三种基本的机器学习范式之一。强化学习与监督学习的不同之处在于，不需要呈现带标签的输入/输出对，也不需要显式纠正次优动作。相反，重点是在探索（未知领域）和利用（当前知识）之间找到平衡。该环境通常以马尔可夫决策过程（MDP）的形式陈述，因为针对这种情况的许多强化学习算法都使用动态编程技术。经典动态规划方法和强化学习算法之间的主要区别在于，后者不假设MDP的确切数学模型，并且针对无法采用精确方法的大型MDP。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

用于强化学习的扩散模型：基础、分类与发展

用于强化学习的扩散模型：基础、分类与发展

专知会员服务

23+阅读 · 2025年10月15日

离线强化学习研究综述

离线强化学习研究综述

专知会员服务

38+阅读 · 2025年1月12日

面向强化学习的可解释性研究综述

面向强化学习的可解释性研究综述

专知会员服务

44+阅读 · 2024年7月30日

扩散模型和强化学习如何结合？上交最新《强化学习中的扩散模型》综述

扩散模型和强化学习如何结合？上交最新《强化学习中的扩散模型》综述

专知会员服务

83+阅读 · 2023年11月3日

【剑桥大学博士论文】机器学习中的分布外泛化，214页pdf

【剑桥大学博士论文】机器学习中的分布外泛化，214页pdf

专知会员服务

87+阅读 · 2023年9月13日

强化学习如何因果化？看最新《因果强化学习》综述论文，39页pdf

强化学习如何因果化？看最新《因果强化学习》综述论文，39页pdf

专知会员服务

84+阅读 · 2023年2月7日

基于通信的多智能体强化学习进展综述

基于通信的多智能体强化学习进展综述

专知会员服务

112+阅读 · 2022年11月12日

强化学习可解释性基础问题探索和方法综述

强化学习可解释性基础问题探索和方法综述

专知会员服务

92+阅读 · 2022年1月16日

「元强化学习」报告，斯坦福Chelsea Finn讲解，52页ppt，Meta Reinforcement Learning

「元强化学习」报告，斯坦福Chelsea Finn讲解，52页ppt，Meta Reinforcement Learning

专知会员服务

43+阅读 · 2021年1月11日

【硬核书】可扩展机器学习：并行分布式方法

【硬核书】可扩展机器学习：并行分布式方法

专知会员服务

86+阅读 · 2020年5月23日

「基于通信的多智能体强化学习」进展综述

「基于通信的多智能体强化学习」进展综述

专知

32+阅读 · 2022年11月12日

【牛津大学博士论文】深度强化学习的归纳偏差和泛化,168页pdf

【牛津大学博士论文】深度强化学习的归纳偏差和泛化,168页pdf

专知

10+阅读 · 2022年10月6日

基于模型的强化学习综述

基于模型的强化学习综述

专知

42+阅读 · 2022年7月13日

【总结】强化学习需要批归一化(Batch Norm)吗？

【总结】强化学习需要批归一化(Batch Norm)吗？

深度强化学习实验室

28+阅读 · 2020年10月8日

探索(Exploration)还是利用(Exploitation)？强化学习如何tradeoff？

探索(Exploration)还是利用(Exploitation)？强化学习如何tradeoff？

深度强化学习实验室

13+阅读 · 2020年8月23日

Distributional Soft Actor-Critic (DSAC)强化学习算法的设计与验证

Distributional Soft Actor-Critic (DSAC)强化学习算法的设计与验证

深度强化学习实验室

19+阅读 · 2020年8月11日

【万字长文总结】如何解决"稀疏奖励(Sparse Reward)"下的强化学习问题？

【万字长文总结】如何解决"稀疏奖励(Sparse Reward)"下的强化学习问题？

深度强化学习实验室

43+阅读 · 2020年7月6日

基于逆强化学习的示教学习方法综述

基于逆强化学习的示教学习方法综述

计算机研究与发展

16+阅读 · 2019年2月25日

强化学习落地！京东等发布综述《深度强化学习在搜索，推荐和在线广告中的应用》

强化学习落地！京东等发布综述《深度强化学习在搜索，推荐和在线广告中的应用》

专知

26+阅读 · 2019年2月19日

基于强化学习的量化交易框架

基于强化学习的量化交易框架

机器学习研究会

30+阅读 · 2018年2月22日

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

42+阅读 · 2015年12月31日

基于重要性采样的并行离策略强化学习方法研究

国家自然科学基金

23+阅读 · 2015年12月31日

基于微型批量采样的分布式多智能个体网络协同优化算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

分布式有监督学习的学习理论

国家自然科学基金

17+阅读 · 2015年12月31日

模糊认知集群优化的聚类算法

国家自然科学基金

9+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

复杂数据模型中的分布逼近方法

国家自然科学基金

3+阅读 · 2014年12月31日

基于逆向强化学习和人工智能的移动机器人自主学习方法研究

国家自然科学基金

12+阅读 · 2013年12月31日

强化学习关键技术及其在机器人行为学习中的应用

国家自然科学基金

23+阅读 · 2009年12月31日

基于支持向量机的复杂连续系统强化学习控制研究

国家自然科学基金

11+阅读 · 2008年12月31日

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Arxiv

0+阅读 · 2月16日

Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Arxiv

0+阅读 · 2月15日

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Arxiv

0+阅读 · 2月12日

Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

Arxiv

0+阅读 · 2月11日

A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control

Arxiv

0+阅读 · 2月9日

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

Arxiv

0+阅读 · 2月4日

Flexible Multitask Learning with Factorized Diffusion Policy

Arxiv

0+阅读 · 2月1日

Dichotomous Diffusion Policy Optimization

Arxiv

0+阅读 · 2月1日

RN-D: Discretized Categorical Actors with Regularized Networks for On-Policy Reinforcement Learning

Arxiv

0+阅读 · 1月30日

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Arxiv

0+阅读 · 1月22日

VIP会员

文章信息

相关主题

相关VIP内容

用于强化学习的扩散模型：基础、分类与发展

用于强化学习的扩散模型：基础、分类与发展

专知会员服务

23+阅读 · 2025年10月15日

离线强化学习研究综述

离线强化学习研究综述

专知会员服务

38+阅读 · 2025年1月12日

面向强化学习的可解释性研究综述

面向强化学习的可解释性研究综述

专知会员服务

44+阅读 · 2024年7月30日

扩散模型和强化学习如何结合？上交最新《强化学习中的扩散模型》综述

扩散模型和强化学习如何结合？上交最新《强化学习中的扩散模型》综述

专知会员服务

83+阅读 · 2023年11月3日

【剑桥大学博士论文】机器学习中的分布外泛化，214页pdf

【剑桥大学博士论文】机器学习中的分布外泛化，214页pdf

专知会员服务

87+阅读 · 2023年9月13日

强化学习如何因果化？看最新《因果强化学习》综述论文，39页pdf

强化学习如何因果化？看最新《因果强化学习》综述论文，39页pdf

专知会员服务

84+阅读 · 2023年2月7日

基于通信的多智能体强化学习进展综述

基于通信的多智能体强化学习进展综述

专知会员服务

112+阅读 · 2022年11月12日

强化学习可解释性基础问题探索和方法综述

强化学习可解释性基础问题探索和方法综述

专知会员服务

92+阅读 · 2022年1月16日

「元强化学习」报告，斯坦福Chelsea Finn讲解，52页ppt，Meta Reinforcement Learning

「元强化学习」报告，斯坦福Chelsea Finn讲解，52页ppt，Meta Reinforcement Learning

专知会员服务

43+阅读 · 2021年1月11日

【硬核书】可扩展机器学习：并行分布式方法

【硬核书】可扩展机器学习：并行分布式方法

专知会员服务

86+阅读 · 2020年5月23日

热门VIP内容

开通专知VIP会员享更多权益服务

智能体记忆深度剖析：评价指标与系统局限性的分类体系及实证分析

《可信人工智能赋能系统的支柱》

【CMU博士论文】可靠轨迹预测的分层基石：数据、评估与方法

人工智能赋能边缘与自主系统：美陆军现代化进程聚焦威胁探测与战术边缘情报

相关资讯

「基于通信的多智能体强化学习」进展综述

「基于通信的多智能体强化学习」进展综述

专知

32+阅读 · 2022年11月12日

【牛津大学博士论文】深度强化学习的归纳偏差和泛化,168页pdf

【牛津大学博士论文】深度强化学习的归纳偏差和泛化,168页pdf

专知

10+阅读 · 2022年10月6日

基于模型的强化学习综述

基于模型的强化学习综述

专知

42+阅读 · 2022年7月13日

【总结】强化学习需要批归一化(Batch Norm)吗？

【总结】强化学习需要批归一化(Batch Norm)吗？

深度强化学习实验室

28+阅读 · 2020年10月8日

探索(Exploration)还是利用(Exploitation)？强化学习如何tradeoff？

探索(Exploration)还是利用(Exploitation)？强化学习如何tradeoff？

深度强化学习实验室

13+阅读 · 2020年8月23日

Distributional Soft Actor-Critic (DSAC)强化学习算法的设计与验证

Distributional Soft Actor-Critic (DSAC)强化学习算法的设计与验证

深度强化学习实验室

19+阅读 · 2020年8月11日

【万字长文总结】如何解决"稀疏奖励(Sparse Reward)"下的强化学习问题？

【万字长文总结】如何解决"稀疏奖励(Sparse Reward)"下的强化学习问题？

深度强化学习实验室

43+阅读 · 2020年7月6日

基于逆强化学习的示教学习方法综述

基于逆强化学习的示教学习方法综述

计算机研究与发展

16+阅读 · 2019年2月25日

强化学习落地！京东等发布综述《深度强化学习在搜索，推荐和在线广告中的应用》

强化学习落地！京东等发布综述《深度强化学习在搜索，推荐和在线广告中的应用》

专知

26+阅读 · 2019年2月19日

基于强化学习的量化交易框架

基于强化学习的量化交易框架

机器学习研究会

30+阅读 · 2018年2月22日

相关论文

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Arxiv

0+阅读 · 2月16日

Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Arxiv

0+阅读 · 2月15日

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Arxiv

0+阅读 · 2月12日

Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization

Arxiv

0+阅读 · 2月11日

A Review of Online Diffusion Policy RL Algorithms for Scalable Robotic Control

Arxiv

0+阅读 · 2月9日

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

Arxiv

0+阅读 · 2月4日

Flexible Multitask Learning with Factorized Diffusion Policy

Arxiv

0+阅读 · 2月1日

Dichotomous Diffusion Policy Optimization

Arxiv

0+阅读 · 2月1日

RN-D: Discretized Categorical Actors with Regularized Networks for On-Policy Reinforcement Learning

Arxiv

0+阅读 · 1月30日

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Arxiv

0+阅读 · 1月22日

相关基金

针对大规模环境下复杂任务的策略搜索强化学习方法研究

国家自然科学基金

42+阅读 · 2015年12月31日

基于重要性采样的并行离策略强化学习方法研究

国家自然科学基金

23+阅读 · 2015年12月31日

基于微型批量采样的分布式多智能个体网络协同优化算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

分布式有监督学习的学习理论

国家自然科学基金

17+阅读 · 2015年12月31日

模糊认知集群优化的聚类算法

国家自然科学基金

9+阅读 · 2015年12月31日

面向异分布数据的主动学习方法

国家自然科学基金

12+阅读 · 2015年12月31日

复杂数据模型中的分布逼近方法

国家自然科学基金

3+阅读 · 2014年12月31日

基于逆向强化学习和人工智能的移动机器人自主学习方法研究

国家自然科学基金

12+阅读 · 2013年12月31日

强化学习关键技术及其在机器人行为学习中的应用

国家自然科学基金

23+阅读 · 2009年12月31日

基于支持向量机的复杂连续系统强化学习控制研究

国家自然科学基金

11+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员