Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, much as managers supervise employees. Whatever a meta-agent does: coordinating agents, halting risky actions before execution, or repairing failed runs, requires manipulation of agentic execution at runtime. Existing agentic substrates make this hard: they give meta-agents only plain transcripts and environment snapshots, requiring it to build it's own tooling to reconstruct and orchestrate execution state. Therefore, we introduce Shepherd, a Python substrate grounded in functional programming principles, where an agent's execution is itself a first-class object that a meta-agent can inspect and transform. Every model call, tool call, and environment change becomes a structured event in a Git-like execution trace, where any past state can be forked 5x faster than docker commit and replayed. Three example use cases show Shepherd's versatility: (1) a supervisor agent prevents conflicts among parallel coding agents, lifting CooperBench performance from 28.8% to 54.7%; (2) a counterfactual optimizer repairs agent workflows by proposing edits and replaying runs from the point of changed behavior, outperforming MetaHarness on TerminalBench-2 with 58% lower wall-clock; (3) a meta-agent picks fork points during rollouts to improve credit assignment in long-horizon agentic RL, doubling GRPO's gains on TerminalBench-2. We open-source Shepherd to empower future meta-agents with principled and efficient operations over agentic execution.

翻译：摘要：随着大语言模型代理系统承担愈发复杂的任务，它们日益依赖元代理——一种对其它代理进行高阶操作（如同管理者监督员工）的代理。无论元代理执行何种操作（协调代理、在执行前阻止高风险动作，或修复失败的运行），都需要在运行时对代理执行过程进行操控。现有的代理基座使得这一过程变得困难：它们仅向元代理提供纯文本记录和环境快照，迫使其自行构建工具以重构和编排执行状态。因此，我们提出Shepherd——一个基于函数式编程原则的Python基座，它将代理的执行过程本身视为一阶对象，供元代理检查与转换。每一次模型调用、工具调用和环境变更都会成为类似于Git执行追踪中的结构化事件，任何过往状态均可通过比docker commit快5倍的速度进行分支与重放。三个示例用例展示了Shepherd的多功能性：（1）监管代理能阻止并行编码代理间的冲突，将CooperBench性能从28.8%提升至54.7%；（2）反事实优化器通过提议编辑并从行为变更点重放运行来修复代理工作流，在TerminalBench-2上以降低58%的挂钟时间超越MetaHarness；（3）元代理在展开过程中选择分支点，以改善长时域代理强化学习中的信用分配，在TerminalBench-2上将GRPO的收益提升一倍。我们开源Shepherd，旨在通过原则化且高效的代理执行操作赋能未来元代理。