Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows, and acting inside enterprise systems. Yet the tool layer on which this execution depends is still commonly treated as either a hand-written integration artifact or a static list of schemas exposed to a model. This paper introduces Tool Forge, a validation-carrying toolchain for converting natural-language capability intent into governed, sandbox-verified, cataloged tool artifacts and exposing those artifacts to agents through a token-efficient routing layer. Tool Forge treats a tool as a capsule containing intent, capability contract, implementation, dependency policy, tests, documentation, runtime validation evidence, lifecycle state, credential bindings, and routing metadata. It also introduces a Router that exposes intent-scoped tool sessions instead of loading full catalog schemas into the model context. We describe the system architecture, validation pipeline, MCP-facing routing model, governance controls, and initial reproducible benchmarks from the open-source implementation. Across 83 Router benchmark cases, Tool Forge Router achieves aggregate micro-F1 of 0.901 while reducing estimated task-flow tool context by 99.2% relative to naive full-catalog schema exposure. In a 25-case end-to-end generation probe over local-tool tasks, Tool Forge generates 25 of 25 tool bundles, reaches micro-F1 of 0.940 against deterministic acceptance checks, and passes 23 of 25 live sandbox validations. These results are presented as an initial systems benchmark, not as a state-of-the-art claim. The paper identifies remaining challenges in adversarial routing, broader API grounding, sandbox isolation, and cross-system evaluation.
翻译:大型语言模型智能体被越来越多地期望执行操作型任务:调用应用程序编程接口、操作文件、编排工作流,并在企业系统内部运行。然而,这种执行所依赖的工具层,目前仍普遍被视为手工编写的集成制品,或模型可访问的静态模式列表。本文介绍了Tool Forge,一种携带验证的工具链,用于将自然语言表达的能力意图转化为受管控、经沙箱验证、编目归档的工具制品,并通过一个令牌高效的路由层将这些制品暴露给智能体。Tool Forge将工具视为一个胶囊,其中包含意图、能力契约、实现方式、依赖策略、测试用例、文档、运行时验证证据、生命周期状态、凭据绑定以及路由元数据。它还引入了一个路由器,该路由器暴露的是意图范围限定的工具会话,而非将完整的编目模式加载到模型上下文中。我们描述了系统架构、验证流水线、面向模型上下文协议的路由模型、治理控制措施,以及来自开源实现的可复现初步基准测试。在83个路由器基准测试案例中,Tool Forge Router实现了0.901的综合微平均F1分数,同时,与朴素的完整编目模式暴露方式相比,其估计的任务流工具上下文减少了99.2%。在一项针对本地工具任务的25个案例的端到端生成探测中,Tool Forge成功生成了全部25个工具包,针对确定性接受检查达到了0.940的微平均F1分数,并通过了23个实时沙箱验证。这些结果作为一项初始的系统基准测试呈现,而非声称达到最优水平。本文指出了对抗性路由、更广泛的应用程序编程接口接地、沙箱隔离以及跨系统评估方面存在的剩余挑战。