Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.
翻译:紧凑型语言模型可降低工具智能体的成本、延迟与部署风险。然而,MCP风格的工具使用不仅需要孤立的函数调用:智能体必须从实时目录中发现工具、满足模式约束、维护中间输出间的依赖关系,并将最终响应建立在执行证据之上。小型规划器常生成看似合理的流程工作流图,却在工具解析、参数验证、依赖追踪或执行阶段失败。我们认为,小样本蒸馏难以应对这种失效模式。几百条教师轨迹可教会工作流格式,但很少覆盖修复失败计划所需的恢复行为(尤其在工具目录动态变化时)。我们提出Evoflux,一种推理时进化搜索方法,将紧凑型工具使用视为可执行工具工作流的修复。它通过结构化编辑、执行反馈、自适应强度、元引导重设计与多样性剪枝,进化类型化工作流图。在涵盖实时MCP服务器与250个工具的保留MCP-Bench任务上,Evoflux将小型规划器的执行可行性从约3%提升至17-24%。相比之下,基于相同搜索挖掘数据的SFT与SFT+DPO方法或持平、或表现不佳、甚至低于零样本性能;ReAct虽能触及更高峰值,但伴随更高方差与Token开销。这些结果表明,在教师轨迹预算稀缺时,基于执行反馈的搜索更具可靠性。