Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid'' through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.
翻译:群体相对策略优化(GRPO)显著提升了大型语言模型(LLMs)的推理性能。然而,这一成功严重依赖于昂贵的外部验证器或人工规则。这种依赖性不仅导致显著的计算成本和训练延迟,还会产生稀疏奖励,从而阻碍优化效率。为解决这些挑战,我们提出了Latent-GRPO框架,该框架直接从潜在空间几何中推导内在奖励。关键的是,我们的实证分析揭示了一个引人注目的几何特性:正确推理轨迹的终止令牌表示形成了具有高类内相似性的密集聚类,而错误轨迹则作为离群点保持分散。基于这一发现,我们提出了迭代鲁棒质心估计(IRCE)算法,该算法通过球面投影缓解幅度波动,并通过迭代聚合估计一个鲁棒的“真实质心”,从而生成密集且连续的奖励。在多个数据集上的实验结果表明,我们的方法在保持模型性能的同时,相比基线实现了超过2倍的训练加速。此外,大量结果证明了其强大的泛化能力和鲁棒性。代码即将发布。