Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR, which reads a submitted manuscript and emits five 0-100 quality dimensions and a weighted overall score, against the public decision outcomes of a major machine learning venue. AIPR grades by prompting alone, with no fine-tuning on reviews or decisions. Across 300 ICLR submissions with public decision tiers and reviewer ratings, graded under a frozen pipeline with hypotheses pre-registered before any score met any outcome, the overall score separates rejected from accepted submissions (AUROC 0.82, 95% CI 0.78-0.87), rises monotonically across tiers, and tracks the mean reviewer rating. The signal is strongest where we claim it: the lowest-scoring fifth is rejected far above the base rate, with oral papers absent. The validity comes mostly from the model: a one-paragraph prompt on the same model discriminates almost as well as the full pipeline (the small gap favours the pipeline but does not meet the pre-declared criterion, p = 0.09). What the engineering adds is reliability and a grounded review: AIPR's score barely moves across repeated runs (0.7 vs. 2.8 points within-paper SD) where the bare prompt swings, and the same pass returns a rubric-structured, evidence-grounded review rather than a bare number, with the human keeping the decision.

翻译：大型语言模型（LLM）系统越来越多地被提议用于辅助同行评审，然而大多数评估关注的是机器生成的评审文本的措辞，而非系统分配的数字分数的有效性。我们对AIPR进行了验证，该系统读取投稿论文并输出五个0-100维度的质量分数及一个加权总分，以某重要机器学习会议的公开决策结果为基准。AIPR仅通过提示（prompting）进行评分，未对评审或决策进行微调。在涉及300篇ICLR投稿（具有公开决策等级和评审者评分）的实验中，我们在冻结管道且所有分数结果未知前预先注册假设。结果表明，总分能够区分被拒稿与被接收的投稿（AUROC 0.82，95%置信区间0.78-0.87），分数随决策等级单调递增，并与评审者平均评分一致。信号强度在我们声称的领域最为显著：得分最低的五分之一论文被拒率远高于基线比率，且其中无口头报告论文。该有效性主要源于模型本身：使用相同模型的一段式提示（one-paragraph prompt）几乎能达到与完整管道相当的判别能力（微小差距倾向于完整管道，但未满足预先声明的标准，p=0.09）。工程化设计带来的优势在于可靠性和有据可依的评审：AIPR的分数在重复运行中几乎无波动（论文内标准差0.7 vs. 2.8），而裸提示（bare prompt）则变化显著；同时，同一流程可生成基于评分标准且证据充分的评审报告（而非仅一个数字），最终决策权保留给人类。