Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.
翻译:长周期工具使用强化学习可从结果验证中学习,但其轨迹级优势广播至众多推理、API及答案令牌。自蒸馏通过复用策略自身的展开或特权教师网络承诺提供更密集信号。然而,我们证明直接令牌级自蒸馏会无声破坏工具使用:它复述教师行为却不知验证器奖励哪些动作,故有用技能与有害捷径被共同放大。我们提出兄弟引导信用蒸馏(SGCD),将蒸馏用于信用分配而非作为对抗性演员损失。动态采样产生成功与失败的混合兄弟展开;外部LLM将其对比总结为仅训练用的逐步骤信用参考;密集师生散度驱动信用再分配;有界分离信用权重重塑GRPO令牌优势。部署的学生端无需外部LLM、兄弟证据或先知。在AppWorld和τ³-airline上,SGCD优于匹配的GRPO基线:AppWorld测试正常集TGC从42.9提升至45.6,测试挑战集从24.7提升至27.0;τ³-airline的pass@1从0.583提升至0.602。