ISOPO：无需旧策略的近端策略梯度法 (ISOPO: Proximal policy gradients without pi-old) - 专知论文

会员服务 ·

0

梯度 · 步长 · 策略梯度 · 近似 · 势函数 ·

2025 年 12 月 29 日

ISOPO: Proximal policy gradients without pi-old

翻译：ISOPO：无需旧策略的近端策略梯度法

Nilin Abrahamsen

This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.

翻译：本文介绍等距策略优化（ISOPO），这是一种通过单次梯度步长近似自然策略梯度的高效方法。相比之下，现有的近端策略方法（如GRPO或CISPO）需采用重要性比率裁剪的变体进行多次梯度步长，以近似相对于参考策略的自然梯度步长。ISOPO的最简形式是在与优势函数进行缩并前，于费希尔度量中对每条序列的对数概率梯度进行归一化处理。ISOPO的另一变体基于每层的神经正切核变换微批次优势函数。该方法通过单次反向传播逐层实施此变换，相较于原始REINFORCE算法，其计算开销可忽略不计。

0

相关内容

梯度的本意是一个向量（矢量），表示某一函数在该点处的方向导数沿着该方向取得最大值，即函数在该点处沿着该方向（此梯度的方向）变化最快，变化率最大（为该梯度的模）。

【ICML2025】免费的Fisher？通过回收平方梯度累加器近似Fisher信息矩阵

【ICML2025】免费的Fisher？通过回收平方梯度累加器近似Fisher信息矩阵

专知会员服务

12+阅读 · 2025年7月28日

UnHiPPO：面向不确定性的状态空间模型初始化方法

UnHiPPO：面向不确定性的状态空间模型初始化方法

专知会员服务

11+阅读 · 2025年6月6日

NeurIPS 2021 | 寻找用于变分布泛化的隐式因果因子

NeurIPS 2021 | 寻找用于变分布泛化的隐式因果因子

专知会员服务

17+阅读 · 2021年12月7日

【ICCV2021】模态视频表示的跨模态对比学习

专知会员服务

16+阅读 · 2021年10月4日

【ICML2021】随机傅立叶特征的量化算法

专知会员服务

25+阅读 · 2021年7月31日

【ICML2021】因果匹配领域泛化

【ICML2021】因果匹配领域泛化

专知

12+阅读 · 2021年8月12日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

NAACL 2019 | 一种考虑缓和KL消失的简单VAE训练方法

NAACL 2019 | 一种考虑缓和KL消失的简单VAE训练方法

PaperWeekly

20+阅读 · 2019年4月24日

数据分析师应该知道的16种回归技术：岭回归

数据分析师应该知道的16种回归技术：岭回归

数萃大数据

15+阅读 · 2018年8月11日

CNN 反向传播算法推导

CNN 反向传播算法推导

统计学习与视觉计算组

30+阅读 · 2017年12月29日

直接优化半周长线长的VLSI两阶段迭代布局算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于径向基函数无网格离散的快速多水平算法

国家自然科学基金

0+阅读 · 2015年12月31日

低差分均匀度密码函数的构造与分析

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

Approximation theory for distant Bang calculus

Arxiv

0+阅读 · 1月8日

Simplex-FEM Networks (SiFEN): Learning A Triangulated Function Approximator

Arxiv

0+阅读 · 1月8日

When Lower-Order Terms Dominate: Adaptive Expert Algorithms for Heavy-Tailed Losses

Arxiv

0+阅读 · 1月8日

Avoiding the Price of Adaptivity: Inference in Linear Contextual Bandits via Stability

Arxiv

0+阅读 · 1月8日

Statistical Taylor Expansion: A New and Path-Independent Method for Uncertainty Analysis

Arxiv

0+阅读 · 1月7日

VIP会员

文章信息

相关主题

相关VIP内容

【ICML2025】免费的Fisher？通过回收平方梯度累加器近似Fisher信息矩阵

【ICML2025】免费的Fisher？通过回收平方梯度累加器近似Fisher信息矩阵

专知会员服务

12+阅读 · 2025年7月28日

UnHiPPO：面向不确定性的状态空间模型初始化方法

UnHiPPO：面向不确定性的状态空间模型初始化方法

专知会员服务

11+阅读 · 2025年6月6日

NeurIPS 2021 | 寻找用于变分布泛化的隐式因果因子

NeurIPS 2021 | 寻找用于变分布泛化的隐式因果因子

专知会员服务

17+阅读 · 2021年12月7日

【ICCV2021】模态视频表示的跨模态对比学习

专知会员服务

16+阅读 · 2021年10月4日

【ICML2021】随机傅立叶特征的量化算法

专知会员服务

25+阅读 · 2021年7月31日

热门VIP内容

开通专知VIP会员享更多权益服务

《面向小规模遥感应用引入思维链推理与多模态小语言模型》

《大国竞争时代的美国太空竞争力》50页报告

网络中心战：未来冲突

《自主无人机不会取代战斗机飞行员，将成为其僚机：协同作战飞机是下一代无人作战飞机》报告

相关资讯

【ICML2021】因果匹配领域泛化

【ICML2021】因果匹配领域泛化

专知

12+阅读 · 2021年8月12日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

NAACL 2019 | 一种考虑缓和KL消失的简单VAE训练方法

NAACL 2019 | 一种考虑缓和KL消失的简单VAE训练方法

PaperWeekly

20+阅读 · 2019年4月24日

数据分析师应该知道的16种回归技术：岭回归

数据分析师应该知道的16种回归技术：岭回归

数萃大数据

15+阅读 · 2018年8月11日

CNN 反向传播算法推导

CNN 反向传播算法推导

统计学习与视觉计算组

30+阅读 · 2017年12月29日

相关论文

Approximation theory for distant Bang calculus

Arxiv

0+阅读 · 1月8日

Simplex-FEM Networks (SiFEN): Learning A Triangulated Function Approximator

Arxiv

0+阅读 · 1月8日

When Lower-Order Terms Dominate: Adaptive Expert Algorithms for Heavy-Tailed Losses

Arxiv

0+阅读 · 1月8日

Avoiding the Price of Adaptivity: Inference in Linear Contextual Bandits via Stability

Arxiv

0+阅读 · 1月8日

Statistical Taylor Expansion: A New and Path-Independent Method for Uncertainty Analysis

Arxiv

0+阅读 · 1月7日

相关基金

直接优化半周长线长的VLSI两阶段迭代布局算法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于径向基函数无网格离散的快速多水平算法

国家自然科学基金

0+阅读 · 2015年12月31日

低差分均匀度密码函数的构造与分析

国家自然科学基金

0+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员