Forecasting when AI systems will become capable of meaningfully accelerating AI research is a central challenge for AI safety. Existing benchmarks measure broad capability growth, but may not provide ample early warning signals for recursive self-improvement. We propose measuring AI's capability to autonomously implement end-to-end machine learning pipelines from past AI research breakthroughs, given a minimal task description. By providing a concise task description instead of the full prior work as reference, we hope to better elicit emerging AI research taste. We introduce a proof-of-concept benchmark in which frontier coding agents autonomously implement an AlphaZero-style machine learning pipeline for Connect Four on consumer hardware within a three-hour budget, and we evaluate the resulting game AIs in a round-robin tournament anchored to the Pascal Pons Connect Four solver. Across four agents with eight trials each, we find substantial differentiation: Claude Opus 4.7 won as first-mover against Pons in seven of eight trials, statistically significantly better than the other agents tested, none of which exceeded two of eight. The task, which no frontier agent could reliably complete when we began development in January of 2026, is now near-saturation. Our evaluation also surfaced anomalous behavior in GPT-5.4, which consistently used far less of its allocated time budget than other agents. A follow-up 16-trial probe using shorter, less evaluation-coded prompts substantially increased GPT-5.4's time-budget usage, consistent with but not diagnostic of sandbagging; Bradley-Terry ratings across probe conditions showed only directional differences, despite significant differences in time-budget usage. We release our data, code, and prompts to support reproduction and extension.
翻译:预测AI系统何时能够实质性加速AI研究是AI安全领域的核心挑战。现有基准测试衡量广泛能力增长,但可能无法为递归式自我改进提供充足的早期预警信号。我们提出通过测量AI自主实现过往AI研究突破中的端到端机器学习流程的能力(基于最小化任务描述),来评估其研究品味。通过提供简洁任务描述而非完整参考文献,我们期望更有效地激发新兴AI研究品味。我们引入概念验证基准:前沿编码Agent需在三小时预算内,在消费级硬件上自主实现面向四子棋的AlphaZero风格机器学习流程,并通过循环赛制与Pascal Pons四子棋求解器进行锚定比较。在四个Agent各八次试验中,我们发现显著分化:Claude Opus 4.7在对阵Pons的八次试验中有七次作为先手获胜,统计学上显著优于其他测试Agent(最高未超过两次)。该任务在2026年1月项目启动时尚无前沿Agent可稳定完成,现接近饱和。评估同时发现GPT-5.4的异常行为——该模型始终使用远少于其他Agent的分配时间预算。后续采用更简短、评估编码较少的提示词进行的16次试验探测,显著增加了GPT-5.4的时间预算使用率,这与伪装性能下降现象一致但尚不能完全确诊;尽管时间预算使用率存在显著差异,探测条件间Bradley-Terry评分仅呈现方向性差异。我们开源数据、代码及提示词以支持复现与扩展研究。