First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope

We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.

翻译：我们报告了两种最先进的智能体AI系统——Claude Code（Anthropic）与Codex（OpenAI）——在共享计算基础设施上无需人工干预自主执行一个简单端到端引力波数据分析流程的比较。该流程包括：基于爱因斯坦望远镜原始模拟噪声的功率谱密度估计、几何模板库生成、针对100个双黑洞信号注入的匹配滤波恢复、自动结果生成，以及基于大语言模型协助的、以《物理评论D》格式输出的论文草稿生成。两个智能体均获得相同的书面技术规范与相同的计算资源。实验执行两次：首次使用不切实际的强信号注入，第二次将信号缩放至物理上合理的信噪比范围。两次实验中科学结果均收敛，但智能体表现出显著不同的行为模式与计算代价：Claude Code在约3.4分钟内完成流程，但存在未报告的规范偏差；Codex则需要约16分钟，历经多次显式自我纠错重启，包括对匹配滤波内循环进行未经要求的性能优化。自主生成的论文草稿在长度、细节与质量上亦存在差异。在第二次实验中，对信噪比范围指令的微妙理解差异导致真正的科学分歧：Claude Code静默地重新解释指令，而Codex严格遵循字面规定。我们讨论了这些行为差异（如速度与可审计性权衡、静默与透明错误处理、指令理解差异，以及多模型流程中中间数据表征的关键性）对智能体AI在科学计算工作流程部署中的启示。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AutoResearch AI综述：迈向AI驱动的科学发现自动化

专知会员服务

15+阅读 · 5月26日

【综述】智能体AI如何重塑软件开发生命周期：从代码补全到人类监督下的委托执行

专知会员服务

14+阅读 · 5月2日

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

25+阅读 · 3月8日

遥感领域的智能体AI：基础理论、分类体系与新兴系统

专知会员服务

29+阅读 · 1月6日