Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist until the next human-driven update ships a fix. Self-evolving agents have emerged in response, but all confine evolution to text-mutable artifacts -- skill files, prompt configurations, memory schemas, workflow graphs -- and leave the agent harness untouched. Since routing, hook ordering, state invariants, and dispatch live in code rather than in any text artifact, an entire class of structural failure is physically unreachable from the text layer. We argue that source-level adaptation is a fundamentally more general medium: it is Turing-complete, a strict superset of every text-mutable scope, takes effect deterministically rather than through base-model compliance, and does not erode under long-context drift. We present MOSS, a system that performs self-rewriting at the source level on production agentic substrates. Each evolution is anchored to an automatically curated batch of production-failure evidence and proceeds through a deterministic multi-stage pipeline; code modification is delegated to a pluggable external coding-agent CLI while MOSS retains stage ordering and verdicts. Candidates are verified by replaying the batch against the candidate image in ephemeral trial workers, then promoted via user-consent-gated, in-place container swap with health-probe-gated rollback. On OpenClaw, MOSS lifts a four-task mean grader score from 0.25 to 0.61 in a single cycle without human intervention.
翻译:自主智能体系统在部署后基本处于静态状态:它们无法从用户交互中学习,重复性故障将持续存在,直至下一次人类驱动的更新发布修复。为应对这一挑战,自演化智能体应运而生,但所有现有方法都将演化局限于文本可变的工件——技能文件、提示配置、记忆模式、工作流图——而智能体框架本身始终保持不变。由于路由、钩子排序、状态不变量和分发逻辑存在于代码而非任何文本工件中,因此从文本层出发根本无法触及整个结构故障类别。我们认为,源代码级适应本质上是一种更通用的媒介:它是图灵完备的,是每一个文本可变作用域的真超集,能确定性地生效而无需依赖基模型顺从性,且在长上下文漂移下不会退化。我们提出MOSS系统,该系统在生产级智能体基座上执行源代码级自重写。每次演化都锚定于自动收集的生产故障证据批次,并通过确定性多阶段流水线执行;代码修改委托给可插拔的外部编码智能体CLI,而MOSS保留阶段排序和判定逻辑。候选修改通过在生产故障批次上回放候选镜像于临时试验工作器中进行验证,随后经由用户许可门控的就地容器替换及健康探测门控的回滚机制进行推送。在OpenClaw基准测试中,MOSS通过单周期无人工干预地将四项任务的平均评分从0.25提升至0.61。