This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.
翻译:本文介绍等距策略优化(ISOPO),这是一种通过单次梯度步长近似自然策略梯度的高效方法。相比之下,现有的近端策略方法(如GRPO或CISPO)需采用重要性比率裁剪的变体进行多次梯度步长,以近似相对于参考策略的自然梯度步长。ISOPO的最简形式是在与优势函数进行缩并前,于费希尔度量中对每条序列的对数概率梯度进行归一化处理。ISOPO的另一变体基于每层的神经正切核变换微批次优势函数。该方法通过单次反向传播逐层实施此变换,相较于原始REINFORCE算法,其计算开销可忽略不计。