Multi-turn tool calling is challenging for Large Language Models (LLMs) because rewards are sparse and exploration is expensive. A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low (e.g., more rollouts in a group receive the all 0 or all 1 reward), making the group-normalized advantage uninformative and yielding vanishing updates. To address this problem, we propose RC-GRPO (Reward-Conditioned Group Relative Policy Optimization), which treats exploration as a controllable steering problem via discrete reward tokens. We first fine-tune a Reward-Conditioned Trajectory Policy (RCTP) on mixed-quality trajectories with reward goal special tokens (e.g., <|high_reward|>, <|low_reward|>) injected into the prompts, enabling the model to learn how to generate distinct quality trajectories on demand. Then during RL, we sample diverse reward tokens within each GRPO group and condition rollouts on the sampled token to improve within-group diversity, improving advantage gains. On the Berkeley Function Calling Leaderboard v4 (BFCLv4) multi-turn benchmark, our method yields consistently improved performance than baselines, and the performance on Qwen-2.5-7B-Instruct even surpasses all closed-source API models.
翻译:多轮工具调用对于大型语言模型而言具有挑战性,因为奖励信号稀疏且探索成本高昂。常见的方案——先进行监督微调再执行组相对策略优化——在组内奖励差异较低时(例如,组内更多轮次获得全0或全1奖励)容易陷入停滞,导致组归一化优势度信息量不足,从而产生趋近于零的更新。为解决此问题,我们提出RC-GRPO(基于奖励条件化的组相对策略优化),该方法通过离散奖励令牌将探索问题转化为可控的导向问题。我们首先在混合质量轨迹上微调一个奖励条件化轨迹策略模型,并在提示中注入奖励目标特殊令牌(例如<|high_reward|>、<|low_reward|>),使模型能够学习如何按需生成不同质量的轨迹。随后在强化学习阶段,我们在每个GRPO组内采样多样化的奖励令牌,并基于采样的令牌条件化生成探索轨迹,从而提升组内多样性,改善优势度增益。在伯克利函数调用排行榜v4的多轮基准测试中,本方法相比基线模型取得了持续的性能提升,其中Qwen-2.5-7B-Instruct模型的性能甚至超越了所有闭源API模型。