SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.

翻译：人工智能智能体正越来越多地将过往经验转化为可复用的制品，例如代码、工作流及程序性记忆。复用能够提升效率，但也带来了生命周期可靠性问题：曾经成功的制品可能因环境漂移、任务说明不充分或任务分布变化而失效，尤其是在网页自动化场景中。本文提出SKILL.nb框架，该框架通过基于证据校准的生命周期策略来治理可复用的智能体工作流。SKILL.nb采用选择性形式化策略：执行证据决定了哪些工作流步骤应转化为可执行代码，哪些应保留自然语言引导形式，以及何时需要修订这些选择。工作流以可审计、带版本控制的笔记本形式存储，其中交织了自然语言指引、多语言可执行单元、验证门控、回退路径以及多模态证据（如输出结果、屏幕截图和错误追踪）。在运行时，门控条件执行机制使每个步骤在门控验证通过时运行代码，当漂移导致可执行实现失效时，则局部回退至其他方案。在WebArena-Verified基准上，SKILL.nb单轮任务成功率达到53.7%，比最强基线方法提升3.9个百分点。在三次重复执行评估中，它保留了91.7%的初始成功任务，比次优方法高出15.5个百分点。在有界修复模式下，它恢复了72.9%的后续失败任务，同时将修复后回归率限制在4.2%，而持续运行基线方法该指标为15.0%至17.0%。该方法在Mind2Web跨网站与跨领域任务拆分中也取得了领先表现。在GitLab迁移测试中，SKILL.nb在复用基于GitLab 15.7学习的冻结状态时保持性能稳定，在GitLab 16.11上冻结版本与最新版本的目标差距为-1.7个百分点，在GitLab 18.9上则为+0.6个百分点。这些结果揭示了生命周期治理与门控条件执行是超越单次任务成功指标之外的关键可靠性维度。