Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning

Reinforcement learning with verifiable rewards has significantly advanced the reasoning capabilities of large language models, yet how to explicitly steer training toward exploration or exploitation remains an open problem. We introduce Token Hidden Reward (THR), a token-level metric that quantifies each token's influence on the likelihood of correct responses under Group Relative Policy Optimization (GRPO). We find that training dynamics are dominated by a small subset of tokens with high absolute THR values. Most interestingly, tokens with positive THR strengthen confidence in correct outputs, thus favoring exploitation, while tokens with negative THR preserve probability mass for alternative outputs, enabling exploration. This insight suggests a natural intervention: a THR-guided reweighting algorithm that modulates GRPO's learning signals to explicitly bias training toward exploitation or exploration. We validate the efficacy of this algorithm on diverse math reasoning benchmarks. By amplifying tokens with positive THR value and weakening negative ones, our algorithm improves greedy-decoding accuracy, favoring exploitation. The reverse strategy yields consistent gains in Pass@K accuracy, favoring exploration. We further demonstrate that our algorithm integrates seamlessly with other RL objectives such as GSPO and generalizes across architectures including Llama. These findings establish THR as a principled and fine-grained mechanism for dynamically controlling exploration and exploitation in RL-tuned LLMs, providing new tools for targeted fine-tuning in reasoning-intensive applications.

翻译：具有可验证奖励的强化学习显著提升了大型语言模型的推理能力，但如何显式引导训练偏向探索或利用仍是一个开放性问题。本文提出Token隐藏奖励（THR），这是一种在群体相对策略优化（GRPO）框架下量化各token对正确响应似然影响的token级指标。我们发现训练动态主要由具有高绝对THR值的少量token主导。最有趣的是，具有正THR值的token会增强对正确输出的置信度，从而偏向利用策略；而具有负THR值的token则为替代输出保留概率质量，实现探索能力。这一发现启发了一种自然干预策略：THR引导的重新加权算法，通过调节GRPO的学习信号来显式地将训练偏向探索或利用。我们在多样化的数学推理基准测试中验证了该算法的有效性。通过增强正THR值token并削弱负THR值token，该算法提升了贪婪解码准确率，强化了利用策略；相反策略则持续提升Pass@K准确率，促进探索行为。我们进一步证明该算法可与GSPO等其他强化学习目标无缝集成，并适用于包括Llama在内的多种架构。这些发现确立了THR作为在RL调优LLM中动态控制探索与利用的原则性细粒度机制，为推理密集型应用中的定向微调提供了新工具。