Reinforcement Learning (RL) is pivotal for enhancing Large Language Model (LLM) reasoning, yet mainstream algorithms such as GRPO and DAPO remain constrained by a coarse-grained credit assignment paradigm, where all tokens within the same response receive the identical reward. In this paper, we propose Dynamic Entropy Weighting, systematically define entropy-based weight ratios $\frac{H_{i,t}}{\sum_{k=1}^{n} H_{k,t}}$ and similar variants to redistribute rewards and get fine-grained rewards through two new algorithms: Group Token Policy Optimization (GTPO), which assigns an entropy-weighted reward to each token and synthesizes token-specific advantage function to drive the model toward optimal path, and the analogous algorithm Sequence-Level GRPO (GRPO-S), which extends this design to the sequence level and exhibits superior stability in long Chain-of-Thought (CoT) reasoning tasks.
翻译:强化学习(RL)对于提升大语言模型(LLM)的推理能力至关重要,然而主流算法如GRPO和DAPO仍受限于粗粒度的信用分配范式,即同一响应中的所有令牌均获得相同的奖励。本文提出动态熵加权方法,系统性地定义基于熵的权重比率 $\frac{H_{i,t}}{\sum_{k=1}^{n} H_{k,t}}$ 及其类似变体,以重新分配奖励并获取细粒度奖励。我们通过两种新算法实现这一目标:组令牌策略优化(GTPO),为每个令牌分配熵加权奖励并合成令牌特定的优势函数,以驱动模型朝向最优路径;以及类似的序列级GRPO算法(GRPO-S),将此设计扩展至序列层面,并在长链思维(CoT)推理任务中展现出更优的稳定性。