Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.
翻译:多轮工具使用智能体需要在多次交互轮次中进行推理、调用工具并适应观察结果。此类智能体的后训练极具挑战性:强化学习虽与仅提示推理设置相匹配,但常面临奖励稀疏与信用分配薄弱的问题;而基于专家轨迹的监督微调虽提供密集过程监督,却可能过度约束模型遵循固定轨迹。为解决此问题,我们提出PACT——面向多轮工具使用智能体的特权痕迹协同训练框架。其核心思想是仅将专家轨迹用作训练时优化信号而非推出时线索。PACT保持推出生成仅基于提示,随后通过两种互补信号利用专家轨迹引导优化:基于痕迹条件的RL代理函数(在专家轨迹上下文中评估仅提示推出),以及带退火强度的组件感知SFT损失(监督推理前缀与工具调用)。为减少对训练时痕迹上下文的过度依赖,PACT进一步引入仅提示锚定机制。我们还提供潜在痕迹视角,连接两种基于痕迹的目标函数,解释专家轨迹如何在不用于推出生成的情况下引导优化。在FTRL、BFCL与ToolHop上的实验表明,PACT持续优于强SFT与RL基线,凸显了特权痕迹协同训练对多轮工具使用学习的价值。