Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
翻译:具备工具集成推理能力的大型语言模型能够通过迭代规划、调用外部工具并整合返回信息来解决复杂、长周期的推理任务。智能体强化学习在整个工具交互轨迹上优化此类模型,但两个关键挑战阻碍了其有效性:(1) 稀疏且无指导性的奖励(如可验证的二元0-1信号)对中间步骤的引导有限,导致收敛缓慢;(2) 组相对策略优化中的梯度退化问题,即同一组轨迹内的相同奖励导致优势为零,降低了样本效率并破坏了训练稳定性。为应对这些挑战,我们提出了两种互补技术:渐进式奖励塑造和基于价值的采样策略优化。PRS是一种受课程学习启发的奖励设计方法,它引入密集的、分阶段的反馈——鼓励模型首先掌握可解析且格式正确的工具调用,然后优化事实准确性和答案质量。我们为短问答(采用长度感知的BLEU评分以公平评估简洁答案)和长问答(采用LLM-as-a-Judge评分以防止奖励攻击)实例化了PRS。VSPO是GRPO的增强变体,它通过平衡难度与不确定性的任务价值指标选择的提示替换低价值样本,并应用价值平滑裁剪以稳定梯度更新。在多个短问答和长问答基准测试上的实验表明,PRS始终优于传统的二元奖励,而VSPO相比PPO、GRPO、CISPO和仅SFT基线,实现了更高的稳定性、更快的收敛速度和更优的最终性能。PRS与VSPO共同产生了基于LLM的TIR智能体,其在跨领域任务中展现出更好的泛化能力。