Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent's self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley's concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $\mathrm{CMP}$ is sufficient to simulate how the G\"odel Machine would behave under certain assumptions. We introduce the Huxley-G\"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using less wall-clock time. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is available at https://github.com/metauto-ai/HGM.
翻译:近期研究通过可编辑自身代码库的编码智能体实现了自改进操作。这些研究采用偏好更高软件工程基准性能的扩展策略来生成自修改树,并假设这预示着后续更具前景的自修改。然而,我们发现智能体的自改进潜力(元生产力)与其编码基准性能之间存在不匹配,即元生产力-性能失配问题。受赫胥黎支系概念的启发,我们提出了一种度量指标($\mathrm{CMP}$),该指标通过聚合智能体所有后代的基准性能来表征其自改进潜力。我们证明,在自改进编码智能体开发场景中,获取真实$\mathrm{CMP}$值足以模拟哥德尔机在特定假设下的行为。我们提出赫胥黎-哥德尔机,该机制通过估计$\mathrm{CMP}$值并以其为指导,在自修改树中进行搜索。在SWE-bench Verified和Polyglot基准测试中,HGM在减少实际运行时间的同时,超越了现有自改进编码智能体开发方法。尤为重要的是,HGM展现出对其他编码数据集及大语言模型的强大迁移能力。基于GPT-5-mini在SWE-bench Verified上通过HGM优化的智能体,在采用GPT-5的SWE-bench Lite评估中达到人类水平性能,与人工设计编码智能体经官方验证的最佳结果持平。代码已开源:https://github.com/metauto-ai/HGM。