Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR, which reads a submitted manuscript and emits five 0-100 quality dimensions and a weighted overall score, against the public decision outcomes of a major machine learning venue. AIPR grades by prompting alone, with no fine-tuning on reviews or decisions. Across 300 ICLR submissions with public decision tiers and reviewer ratings, graded under a frozen pipeline with hypotheses pre-registered before any score met any outcome, the overall score separates rejected from accepted submissions (AUROC 0.82, 95% CI 0.78-0.87), rises monotonically across tiers, and tracks the mean reviewer rating. The signal is strongest where we claim it: the lowest-scoring fifth is rejected far above the base rate, with oral papers absent. The validity comes mostly from the model: a one-paragraph prompt on the same model discriminates almost as well as the full pipeline (the small gap favours the pipeline but does not meet the pre-declared criterion, p = 0.09). What the engineering adds is reliability and a grounded review: AIPR's score barely moves across repeated runs (0.7 vs. 2.8 points within-paper SD) where the bare prompt swings, and the same pass returns a rubric-structured, evidence-grounded review rather than a bare number, with the human keeping the decision.
翻译:大型语言模型(LLM)系统越来越多地被提议用于辅助同行评审,然而大多数评估关注的是机器生成的评审文本的措辞,而非系统分配的数字分数的有效性。我们对AIPR进行了验证,该系统读取投稿论文并输出五个0-100维度的质量分数及一个加权总分,以某重要机器学习会议的公开决策结果为基准。AIPR仅通过提示(prompting)进行评分,未对评审或决策进行微调。在涉及300篇ICLR投稿(具有公开决策等级和评审者评分)的实验中,我们在冻结管道且所有分数结果未知前预先注册假设。结果表明,总分能够区分被拒稿与被接收的投稿(AUROC 0.82,95%置信区间0.78-0.87),分数随决策等级单调递增,并与评审者平均评分一致。信号强度在我们声称的领域最为显著:得分最低的五分之一论文被拒率远高于基线比率,且其中无口头报告论文。该有效性主要源于模型本身:使用相同模型的一段式提示(one-paragraph prompt)几乎能达到与完整管道相当的判别能力(微小差距倾向于完整管道,但未满足预先声明的标准,p=0.09)。工程化设计带来的优势在于可靠性和有据可依的评审:AIPR的分数在重复运行中几乎无波动(论文内标准差0.7 vs. 2.8),而裸提示(bare prompt)则变化显著;同时,同一流程可生成基于评分标准且证据充分的评审报告(而非仅一个数字),最终决策权保留给人类。