Reinforcement Learning from Human Feedback via Proximal Policy Optimization often suffers from policy mode collapse, brittle exploration loops, and distribution drift. This paper introduces Variational Proximal Policy Optimization (\(\textsc{VP}_2\textsc{O}\)), a particle-based variational inference framework that maps policy optimization to Stein Variational Gradient Descent within a Mixture-of-Experts architecture. By leveraging functional kernels over localized expert prototypes alongside an expert orthogonalization loss, \(\textsc{VP}_2\textsc{O}\) introduces a geometry-based proximal-control mechanism that can reduce reliance on fixed clipping or KL schedules. Our results on a 33B/4B sparse Mixture-of-Experts model show several improvements across complex reasoning benchmarks, establishing a \(+\mathbf{179}\) ELO gain on Codeforces and a \(\mathbf{32\%}\) reduction in token count on AIME mathematical reasoning tasks.
翻译:基于近端策略优化的人类反馈强化学习常常面临策略模式坍塌、脆性探索循环与分布漂移等问题。本文提出变分近端策略优化(\(\textsc{VP}_2\textsc{O}\)),这是一种基于粒子变分推演的框架,将策略优化映射为混合专家架构下的斯坦因变分梯度下降。通过利用局部专家原型的函数核与专家正交化损失,\(\textsc{VP}_2\textsc{O}\) 引入了一种基于几何的近端控制机制,可减少对固定裁剪或KL调度方案的依赖。我们在一个33B/4B稀疏混合专家模型上的实验表明,该方法在复杂推理基准测试中实现了多项改进,在Codeforces上取得了\(\mathbf{+179}\) ELO增益,并在AIME数学推理任务上将词元数量减少了\(\mathbf{32\%}\)。