There are increasing indications that LLMs are not only used for producing scientific papers, but also as part of the peer review process. In this work, we provide the first comprehensive analysis of LLM use across the peer review pipeline, with particular attention to interaction effects: not just whether LLM-assisted papers or LLM-assisted reviews are different in isolation, but whether LLM-assisted reviews evaluate LLM-assisted papers differently. In particular, we analyze over 125,000 paper-review pairs from ICLR, NeurIPS, and ICML. We initially observe what appears to be a systematic interaction effect: LLM-assisted reviews seem especially kind to LLM-assisted papers compared to papers with minimal LLM use. However, controlling for paper quality reveals a different story: LLM-assisted reviews are simply more lenient toward lower quality papers in general, and the over-representation of LLM-assisted papers among weaker submissions creates a spurious interaction effect rather than genuine preferential treatment of LLM-generated content. By augmenting our observational findings with reviews that are fully LLM-generated, we find that fully LLM-generated reviews exhibit severe rating compression that fails to discriminate paper quality, while human reviewers using LLMs substantially reduce this leniency. Finally, examining metareviews, we find that LLM-assisted metareviews are more likely to render accept decisions than human metareviews given equivalent reviewer scores, though fully LLM-generated metareviews tend to be harsher. This suggests that meta-reviewers do not merely outsource the decision-making to the LLM. These findings provide important input for developing policies that govern the use of LLMs during peer review, and they more generally indicate how LLMs interact with existing decision-making processes.
翻译:越来越多的迹象表明,大语言模型不仅被用于撰写科学论文,还参与到同行评审过程中。本研究首次对同行评审全流程中的大语言模型使用情况进行了全面分析,特别关注交互效应:不仅考察大语言模型辅助撰写的论文或大语言模型辅助完成的评审是否存在独立差异,更探究大语言模型辅助完成的评审是否会以不同方式评价大语言模型辅助撰写的论文。具体而言,我们分析了来自ICLR、NeurIPS和ICML的超过125,000组论文-评审配对数据。初步观察显示存在看似系统性的交互效应:相较于极少使用大语言模型的论文,大语言模型辅助完成的评审似乎对大语言模型辅助撰写的论文特别宽容。然而,在控制论文质量变量后,情况截然不同:大语言模型辅助完成的评审普遍对低质量论文更为宽松,而大语言模型辅助撰写的论文在较弱投稿中的过度集中造成了虚假的交互效应,而非对生成内容的真正优待。通过引入完全由大语言模型生成的评审来增强观察性研究发现,完全由大语言模型生成的评审表现出严重的评分压缩现象,无法有效区分论文质量,而使用大语言模型的人类评审者则显著降低了这种宽松倾向。最后,通过分析元评审发现,在评审者评分相同的情况下,大语言模型辅助完成的元评审比人类元评审更倾向于做出录用决定,尽管完全由大语言模型生成的元评审往往更为严苛。这表明元评审者并非简单地将决策权外包给大语言模型。这些发现为制定管理同行评审过程中大语言模型使用的政策提供了重要依据,更广泛地揭示了大型语言模型如何与现有决策流程产生交互作用。