We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 635 paper-code discrepancies (92 real, 543 synthetic), covering the AI domain from real-world data and extending to Physics, Quantitative Biology, and other computational sciences through synthetic data. Our evaluation of 22 LLMs demonstrates the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best-performing models in our evaluation, Gemini 3.1 Pro and GPT-5 Mini, detect only 46.7% of real-world paper-code discrepancies.
翻译:我们提出SciCoQA数据集,用于检测科研出版物与其代码库之间的不一致性,以确保实现过程的忠实性。该数据集基于GitHub议题与可复现性论文构建,并提出了一种合成数据生成方法以扩展论文-代码不一致性的构建规模。通过深入分析论文-代码不一致现象,我们定义了不一致类型与类别,以更系统地理解各类偏差。最终数据集包含635个论文-代码不一致实例(92个真实实例,543个合成实例),覆盖来自真实世界数据的人工智能领域,并通过合成数据拓展至物理学、定量生物学及其他计算科学领域。对22个大语言模型的评估表明,SciCoQA任务存在显著难度,尤其对于涉及论文细节缺失、长上下文输入以及超出模型预训练语料范围的数据的实例。评估中表现最佳的模型Gemini 3.1 Pro与GPT-5 Mini仅能检测出46.7%的真实论文-代码不一致。