Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, which reducing sample efficiency. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces zero advantages samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to SFT, PPO and GRPO baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.
翻译:具备工具集成推理能力的大型语言模型能够通过迭代规划、调用外部工具并整合返回信息来解决复杂的长程推理任务。智能体强化学习在整个工具交互轨迹上优化此类模型,但两个关键挑战阻碍了其有效性:(1) 稀疏且缺乏指导性的奖励信号,例如可验证的二元0-1信号,难以为中间步骤提供有效引导,导致收敛缓慢;(2) 组相对策略优化中的梯度退化问题,即同一组轨迹中相同奖励产生的优势值为零,降低了样本效率。为应对这些挑战,我们提出了两项互补技术:渐进奖励塑形与价值采样策略优化。PRS是一种受课程学习启发的奖励设计方法,它引入密集的阶段式反馈——首先鼓励模型掌握可解析且格式正确的工具调用,然后优化事实准确性与答案质量。我们为短问答任务(采用长度感知BLEU评分以公平评估简洁答案)和长问答任务(采用LLM-as-a-Judge评分以防止奖励破解)分别实现了PRS。VSPO是GRPO的增强变体,它通过基于任务价值指标(平衡难度与不确定性)选择的提示文本来替代零优势样本,并应用价值平滑裁剪以稳定梯度更新。在多个短问答和长问答基准测试上的实验表明:PRS始终优于传统二元奖励机制;与SFT、PPO和GRPO基线相比,VSPO实现了更优的稳定性、更快的收敛速度以及更高的最终性能。PRS与VSPO共同构建的基于LLM的TIR智能体展现出更强的跨领域泛化能力。