AI Coding Agents Can Reproduce Social Science Findings

Recent anecdotal evidence suggests that AI coding agents can reproduce published findings when provided with original data and code; yet systematic evaluation across social sciences remains limited. Existing evaluation benchmarks are insufficient, either small or conflate agent performance with problems in the reproduction materials themselves, such as code that fails to execute correctly. Here we introduce SocSci-Repro-Bench, a benchmark of 221 tasks spanning four disciplines and 13 substantive domains, constructed from studies whose results are either fully reproducible with available materials or demonstrably non-reproducible due to missing data, allowing us to isolate agents' reproduction capacity. Evaluating two frontier coding agents, Claude Code and Codex, we find that both can reproduce a large share of social science findings, with Claude Code substantially outperforming Codex. These reproduction rates considerably exceed those previously reported for general-purpose LLM-based agents on comparable reproducibility benchmarks. Both agents also perform strongly on a reasoning task requiring identification of underlying research questions, and additional analyses suggest that results are not primarily driven by memorization. Providing the original paper PDF alongside replication materials modestly improves performance but introduces bias on tasks where reproduction is impossible. We also show that agents can be nudged toward confirmatory specification search through subtle prompt framing. Together, these findings suggest that at least some frontier coding agents can serve as reliable executors of computational workflows while underscoring the need for careful benchmarking and prompt design as AI systems assume larger roles in scientific production.

翻译：近期轶事证据表明，当提供原始数据和代码时，AI编程智能体能够复现已发表的研究结果；然而，在社会科学领域的系统性评估仍十分有限。现有的评估基准要么样本量不足，要么将智能体性能与复现材料（如无法正确执行的代码）本身的问题混为一谈。为此，我们提出了SocSci-Repro-Bench基准测试，包含涵盖四个学科和13个实质性领域的221项任务，这些任务源自其研究结果要么可通过现有材料完全复现，要么因数据缺失而明显不可复现的已发表研究，从而能够独立评估智能体的复现能力。通过对前沿编程智能体Claude Code和Codex进行评测，我们发现两者均能复现大部分社会科学发现，其中Claude Code的性能显著优于Codex。这些复现率远高于此前基于通用大语言模型（LLM）的智能体在类似可复现性基准上的报告结果。此外，两个智能体在需要识别潜在研究问题的推理任务上也表现出色，进一步分析表明其结果并非主要源于记忆机制。提供原始论文PDF及复现材料能适度提升性能，但在无法复现的任务中会引入偏差。我们还发现，通过微妙的提示框架可引导智能体进行确认性规范搜索。这些发现共同表明，至少部分前沿编程智能体能够可靠地执行计算工作流程，同时也强调在AI系统在科学研究中承担更大角色时，需谨慎设计基准测试与提示指令。