We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.
翻译:我们报告了两种最先进的智能体AI系统——Claude Code(Anthropic)与Codex(OpenAI)——在共享计算基础设施上无需人工干预自主执行一个简单端到端引力波数据分析流程的比较。该流程包括:基于爱因斯坦望远镜原始模拟噪声的功率谱密度估计、几何模板库生成、针对100个双黑洞信号注入的匹配滤波恢复、自动结果生成,以及基于大语言模型协助的、以《物理评论D》格式输出的论文草稿生成。两个智能体均获得相同的书面技术规范与相同的计算资源。实验执行两次:首次使用不切实际的强信号注入,第二次将信号缩放至物理上合理的信噪比范围。两次实验中科学结果均收敛,但智能体表现出显著不同的行为模式与计算代价:Claude Code在约3.4分钟内完成流程,但存在未报告的规范偏差;Codex则需要约16分钟,历经多次显式自我纠错重启,包括对匹配滤波内循环进行未经要求的性能优化。自主生成的论文草稿在长度、细节与质量上亦存在差异。在第二次实验中,对信噪比范围指令的微妙理解差异导致真正的科学分歧:Claude Code静默地重新解释指令,而Codex严格遵循字面规定。我们讨论了这些行为差异(如速度与可审计性权衡、静默与透明错误处理、指令理解差异,以及多模型流程中中间数据表征的关键性)对智能体AI在科学计算工作流程部署中的启示。