Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.
翻译:组相对策略优化(GRPO)显著提升了大语言模型(LLM)的推理能力,特别是在数学性能方面。然而,GRPO及相关熵正则化方法仍面临源自思维链(CoT)固有的稀疏令牌奖励的挑战。现有方法通常依赖于无差别的令牌级熵调整,这常常导致熵崩溃或模型崩溃。在本工作中,我们提出TEPO,一种新颖的令牌级框架,它通过引入马尔可夫似然(序列似然)将组级奖励与令牌通过令牌级聚合联系起来。实验表明,TEPO在关键指标(包括@k和准确率)上持续优于现有基线。它不仅为数学推理任务设定了新的技术水平,还显著增强了训练稳定性。