Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver

翻译：前沿编码Agent现已能实现与外部求解器性能相当的AlphaZero自对弈机器学习流程以完成四子棋任务

Joshua Sherwood,Ben Aybar,Benjamin Kaplan

Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI's capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver. Across four agents with eight trials each, we find substantial differentiation: Claude Opus 4.7 won as first-mover against Pons in seven of eight trials, statistically significantly better than the other agents tested, none of which exceeded two of eight. The task, which no frontier agent could reliably complete when we began development in January of 2026, is now near-saturation. Our evaluation also surfaced anomalous behavior in GPT-5.4, which consistently used far less of its allocated time budget than other agents. A follow-up 16-trial probe using shorter, less evaluation-coded prompts substantially increased GPT-5.4's time-budget usage, consistent with but not diagnostic of sandbagging; Bradley-Terry ratings across probe conditions showed only directional differences, despite significant differences in time-budget usage. We release our data, code, and prompts to support reproduction and extension.

翻译：预测AI系统何时能够实质性加速AI研究是AI安全领域的核心挑战。现有基准测试衡量广泛能力增长，但可能无法为递归式自我改进提供充足的早期预警信号。我们提出通过测量AI自主实现过往AI研究突破中的端到端机器学习流程的能力（基于最小化任务描述），来评估其研究品味。通过提供简洁任务描述而非完整参考文献，我们期望更有效地激发新兴AI研究品味。我们引入概念验证基准：前沿编码Agent需在三小时预算内，在消费级硬件上自主实现面向四子棋的AlphaZero风格机器学习流程，并通过循环赛制与Pascal Pons四子棋求解器进行锚定比较。在四个Agent各八次试验中，我们发现显著分化：Claude Opus 4.7在对阵Pons的八次试验中有七次作为先手获胜，统计学上显著优于其他测试Agent（最高未超过两次）。该任务在2026年1月项目启动时尚无前沿Agent可稳定完成，现接近饱和。评估同时发现GPT-5.4的异常行为——该模型始终使用远少于其他Agent的分配时间预算。后续采用更简短、评估编码较少的提示词进行的16次试验探测，显著增加了GPT-5.4的时间预算使用率，这与伪装性能下降现象一致但尚不能完全确诊；尽管时间预算使用率存在显著差异，探测条件间Bradley-Terry评分仅呈现方向性差异。我们开源数据、代码及提示词以支持复现与扩展研究。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AutoScientists：自组织智能体团队驱动长期科学实验

专知会员服务

10+阅读 · 5月29日

AI能预测科学突破吗？CUSP基准揭示前沿模型能力边界

专知会员服务

9+阅读 · 5月23日

2026 年 Agentic AI 工程师完全指南：一份系统化的学习路线图

专知会员服务

49+阅读 · 4月14日

前沿人工智能趋势报告（Frontier AI Trends Report）

专知会员服务

39+阅读 · 2025年12月20日