We present Orthogonalized Policy Optimization (OPO), a unified theoretical account of large language model alignment grounded in a work-dissipation principle. The policy update is characterized as a constrained proximal response that maximizes external work induced by an alpha-escort sampling field, while paying an intrinsic dissipation cost given by a quadratic fluctuation energy in chi-square ratio geometry. This single variational principle admits three equivalent interpretations: (i) a mirror-descent step with a Euclidean mirror map in ratio space, (ii) a Hilbert-space projection via the orthogonal projection theorem in L2(pi_k), and (iii) a linear-response law from near-equilibrium statistical mechanics. Their convergence to the same closed-form update confirms that OPO is the unique quadratic proximal response within ratio geometry. The framework cleanly decouples sampling geometry (alpha) from optimization geometry (mu), yields a constant Hessian and non-saturating linear gradients, and reveals that advantage z-score normalization is not a heuristic but a conservation-law projection. Experiments on mathematical reasoning tasks demonstrate that OPO outperforms GRPO, GSPO, and DAPO while maintaining healthy gradient dynamics throughout training.
翻译:本文提出正交化策略优化(OPO),这是一种基于功-耗散原理的大语言模型对齐统一理论框架。策略更新被表征为约束近端响应,其最大化由α-伴随采样场诱导的外部功,同时支付由χ²比率几何中的二次涨落能量给出的本征耗散成本。该单一变分原理具有三种等价解释:(i) 比率空间中采用欧几里得镜像映射的镜像下降步骤,(ii) 通过L²(π_k)中正交投影定理的希尔伯特空间投影,以及(iii) 近平衡统计力学中的线性响应定律。它们收敛于同一闭式更新的事实证实了OPO是比率几何内唯一的二次近端响应。该框架清晰地将采样几何(α)与优化几何(μ)解耦,产生恒定海森矩阵和非饱和线性梯度,并揭示了优势值Z分数归一化并非启发式方法而是守恒律投影。在数学推理任务上的实验表明,OPO在训练全程保持良好梯度动态的同时,性能优于GRPO、GSPO和DAPO。