Automating Computational Reproducibility in Social Science: Comparing Prompt-Based and Agent-Based Approaches

Reproducing computational research is often assumed to be as simple as rerunning the original code with provided data. In practice, missing packages, fragile file paths, version conflicts, or incomplete logic frequently cause analyses to fail, even when materials are shared. This study investigates whether large language models and AI agents can automate the diagnosis and repair of such failures, making computational results easier to reproduce and verify. We evaluate this using a controlled reproducibility testbed built from five fully reproducible R-based social science studies. Realistic failures were injected, ranging from simple issues to complex missing logic, and two automated repair workflows were tested in clean Docker environments. The first workflow is prompt-based, repeatedly querying language models with structured prompts of varying context, while the second uses agent-based systems that inspect files, modify code, and rerun analyses autonomously. Across prompt-based runs, reproduction success ranged from 31-79 percent, with performance strongly influenced by prompt context and error complexity. Complex cases benefited most from additional context. Agent-based workflows performed substantially better, with success rates of 69-96 percent across all complexity levels. These results suggest that automated workflows, especially agent-based systems, can significantly reduce manual effort and improve reproduction success across diverse error types. Unlike prior benchmarks, our testbed isolates post-publication repair under controlled failure modes, allowing direct comparison of prompt-based and agent-based approaches.

翻译：通常认为，重现计算研究只需使用所提供的数据重新运行原始代码即可。然而在实践中，即使研究材料已共享，缺失的软件包、脆弱的文件路径、版本冲突或不完整的逻辑仍常导致分析失败。本研究探讨大型语言模型与人工智能智能体能否自动诊断并修复此类故障，从而使计算结果更易于复现与验证。我们基于五项完全可复现的R语言社会科学研究构建了受控可重复性测试平台，通过注入从简单问题到复杂逻辑缺失等不同层级的模拟故障，并在纯净的Docker环境中测试两种自动化修复工作流。第一种工作流基于提示机制，通过结构化提示（含不同上下文信息）反复查询语言模型；第二种则采用基于智能体的系统，该系统可自主检查文件、修改代码并重新运行分析。在基于提示的多次实验中，复现成功率介于31%至79%之间，其表现受提示上下文与错误复杂度的显著影响，复杂案例尤其受益于附加上下文信息。基于智能体的工作流表现显著更优，在各复杂度层级上均取得69%至96%的成功率。这些结果表明，自动化工作流（特别是基于智能体的系统）能显著减少人工投入，并提升针对各类错误的复现成功率。与既有基准测试不同，本测试平台在受控故障模式下隔离了发表后的修复过程，从而实现了对基于提示与基于智能体方法的直接比较。