FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

Peer review in machine learning is under growing pressure from rising submission volume and limited reviewer time. Most LLM-based reviewing systems read only the manuscript and generate comments from the paper's own narrative. This makes their outputs sensitive to presentation quality and leaves them weak when the evidence needed for review lies in related work or released code. We present FactReview, an evidence-grounded reviewing system that combines claim extraction, literature positioning, and execution-based claim verification. Given a submission, FactReview identifies major claims and reported results, retrieves nearby work to clarify the paper's technical position, and, when code is available, executes the released repository under bounded budgets to test central empirical claims. It then produces a concise review and an evidence report that assigns each major claim one of five labels: Supported, Supported by the paper, Partially supported, In conflict, or Inconclusive. In a case study on CompGCN, FactReview reproduces results that closely match those reported for link prediction and node classification, yet also shows that the paper's broader performance claim across tasks is not fully sustained: on MUTAG graph classification, the reproduced result is 88.4%, whereas the strongest baseline reported in the paper remains 92.6%. The claim is therefore only partially supported. More broadly, this case suggests that AI is most useful in peer review not as a final decision-maker, but as a tool for gathering evidence and helping reviewers produce more evidence-grounded assessments. The code is public at https://github.com/DEFENSE-SEU/Review-Assistant.

翻译：机器学习领域的同行评审正面临投稿量激增和审稿人时间有限的双重压力。当前多数基于大语言模型的审阅系统仅阅读稿件文本，并依据论文自身叙事生成意见，这导致其输出易受表述质量影响，且在需要参考相关文献或开源代码进行评审时能力薄弱。我们提出"事实评论"——一个基于证据的审阅系统，融合了声明提取、文献定位与执行驱动的声明验证三大模块。给定一份投稿后，该系统首先识别主要声明与报告结果，检索相邻研究以阐明论文技术定位，并在代码可用时在有限预算下执行开源仓库以检验核心实验声明。随后生成简洁的审阅意见与证据报告，将每项主要声明归入五类标签之一：支持、论文自洽、部分支持、矛盾或无法判定。在CompGCN案例研究中，事实评论成功复现了论文在链接预测与节点分类任务上的报告结果，但同时揭示其跨任务的广义性能声明未完全成立：在MUTAG图分类任务上，复现结果为88.4%，而论文中报告的最强基线仍为92.6%，因此该声明仅获得部分支持。更广泛而言，该案例表明人工智能在同行评审中最具价值的角色并非最终决策者，而是作为证据收集工具，帮助审稿人形成更基于证据的评估。代码已公开于https://github.com/DEFENSE-SEU/Review-Assistant。