We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach addresses three challenges in practical agent training: transferring tool-calling knowledge from large models at scale, enabling reinforcement learning without costly live tool execution, and learning robustly from noisy cached environments. CacheRL introduces three key innovations. First, a hybrid thinking trajectory pipeline augments agent trajectories with LLM-generated reasoning traces, producing training examples that teach models not only what tools to call but also why. Second, the CacheAgentLoop eliminates live execution costs through a three-tier fuzzy cache while preserving trajectory fidelity using token-level masking. Third, a cache-tier-aware reward dynamically adjusts answer-quality weights to avoid penalizing models for cache-induced limitations. Through iterative supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), CacheRL improves Qwen3-4B-Thinking's validation reward from 0.43 to 0.78. On public agentic tool-calling benchmarks, our model achieves competitive performance against frontier models such as GPT-5. Ablation studies show that removing knowledge transfer reduces performance by 41 percent, while cache-aware rewards contribute a 17 percent improvement. Interestingly, reinforcement learning improves training stability but yields limited gains beyond strong supervised fine-tuning, suggesting that data quality and reward design play a more important role than complex optimization methods in building practical small agent models.
翻译:我们提出了CacheRL,一种训练小型智能体基础模型的系统,在多步骤工具调用任务上实现了92%的过程准确率,接近GPT-5的94%,同时计算需求降低100倍。我们的方法解决了实际智能体训练中的三个挑战:大规模从大型模型迁移工具调用知识、无需昂贵实时工具执行即可进行强化学习、以及从带噪声的缓存环境中稳健学习。CacheRL引入了三项关键创新。首先,混合思维轨迹管道通过大语言模型生成的推理痕迹增强智能体轨迹,生成不仅教导模型调用何种工具、还能理解为何调用的训练样本。其次,CacheAgentLoop通过三层模糊缓存消除了实时执行成本,并利用词元级掩码保持轨迹保真度。第三,缓存层级感知奖励动态调整答案质量权重,以避免因缓存导致的局限性而惩罚模型。通过迭代监督微调(SFT)和群体相对策略优化(GRPO),CacheRL将Qwen3-4B-Thinking的验证奖励从0.43提升至0.78。在公开的智能体工具调用基准测试中,我们的模型取得了与GPT-5等前沿模型相当的竞争性能。消融研究表明,移除知识迁移会导致性能下降41%,而缓存感知奖励贡献了17%的提升。有趣的是,强化学习提升了训练稳定性,但相较于强监督微调带来的增益有限,这表明在构建实用小型智能体模型时,数据质量和奖励设计比复杂优化方法发挥更重要的作用。