Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF

Large language model alignment objectives are often presented as a collection of distinct algorithms, such as PPO, DPO, IPO, and their variants, each motivated by different derivations. In this work, we argue that this diversity obscures a simpler underlying structure. At a fundamental level, alignment objectives involve two independent design choices: (i) how training signals are sampled and weighted, and (ii) how deviations from a reference policy are geometrically penalized. Existing methods typically entangle these choices through a single divergence, most commonly the Kullback-Leibler divergence. We show that this entanglement is not merely a modeling convenience but a source of systematic instability. When the same divergence simultaneously determines sample weighting and optimization curvature, adjusting one aspect, such as exploration strength, inevitably alters the other, such as gradient geometry. This coupling is particularly problematic in preference-based reinforcement learning, where advantage signals are unbounded and high-confidence regimes are common. We propose a simple but structural remedy by formulating alignment as an orthogonal mirror descent problem, in which sampling geometry enters only as a linear driving force, while optimization geometry is determined independently by a mirror map. This perspective leads to a new alignment objective called Orthogonalized Policy Optimization (OPO), obtained by choosing a Euclidean mirror map in likelihood ratio space. The resulting objective admits a closed-form solution, linear and non-saturating gradient dynamics, and a well-conditioned trust region, while remaining fully compatible with standard large language model training pipelines.

翻译：大型语言模型对齐目标通常被呈现为一系列不同的算法，如PPO、DPO、IPO及其变体，每种算法都有不同的推导动机。在本研究中，我们认为这种多样性掩盖了一个更简单的底层结构。从根本上讲，对齐目标涉及两个独立的设计选择：(i) 训练信号如何采样和加权，以及(ii) 如何从几何角度对参考策略的偏差进行惩罚。现有方法通常通过单一散度（最常见的是Kullback-Leibler散度）将这两个选择纠缠在一起。我们证明这种纠缠不仅是一种建模便利，而且是系统性不稳定的来源。当同一散度同时决定样本加权和优化曲率时，调整一个方面（例如探索强度）不可避免地会改变另一个方面（例如梯度几何）。这种耦合在基于偏好的强化学习中尤其成问题，因为优势信号无界且高置信度区域很常见。我们提出了一种简单但结构性的补救措施，通过将对齐问题表述为正交镜像下降问题，其中采样几何仅作为线性驱动力引入，而优化几何则由镜像映射独立决定。这一视角引出了一个称为正交化策略优化（OPO）的新对齐目标，通过在似然比空间中选择欧几里得镜像映射获得。所得目标具有闭式解、线性且非饱和的梯度动态以及条件良好的信赖域，同时完全兼容标准的大型语言模型训练流程。