Peer review is at the heart of modern science. As submission numbers rise and research communities grow, the decline in review quality is a popular narrative and a common concern. Yet, is it true? Review quality is difficult to measure, and the ongoing evolution of reviewing practices makes it hard to compare reviews across venues and time. To address this, we introduce a new framework for evidence-based comparative study of review quality and apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL. We document the diversity of review formats and introduce a new approach to review standardization. We propose a multi-dimensional schema for quantifying review quality as utility to editors and authors, coupled with both LLM-based and lightweight measurements. We study the relationships between measurements of review quality, and its evolution over time. Contradicting the popular narrative, our cross-temporal analysis reveals no consistent decline in median review quality across venues and years. We propose alternative explanations, and outline recommendations to facilitate future empirical studies of review quality.
翻译:同行评审是现代科学的核心环节。随着投稿数量的增加和研究社区的扩大,评审质量下降已成为一种流行叙事和普遍关切。然而,这一说法是否属实?评审质量难以量化,且评审实践的持续演进使得跨会议、跨时间的比较变得困难。为此,我们提出了一个基于证据的评审质量比较研究新框架,并将其应用于主要人工智能与机器学习会议:ICLR、NeurIPS和*ACL。我们记录了评审形式的多样性,并提出了一种评审标准化的新方法。我们设计了一个多维量化框架,将评审质量定义为对编辑和作者的效用价值,并辅以基于大语言模型和轻量级的测量方法。我们研究了评审质量各项测量指标之间的关系及其随时间的变化趋势。与流行叙事相反,我们的跨时间分析表明,在不同会议和年份中,评审质量中位数并未出现系统性下降。我们提出了替代性解释,并概述了促进未来评审质量实证研究的建议。