Is it possible to reliably evaluate the quality of peer reviews? We study this question driven by two primary motivations -- incentivizing high-quality reviewing using assessed quality of reviews and measuring changes to review quality in experiments. We conduct a large scale study at the NeurIPS 2022 conference, a top-tier conference in machine learning, in which we invited (meta)-reviewers and authors to evaluate reviews given to submitted papers. First, we conduct a RCT to examine bias due to the length of reviews. We generate elongated versions of reviews by adding substantial amounts of non-informative content. Participants in the control group evaluate the original reviews, whereas participants in the experimental group evaluate the artificially lengthened versions. We find that lengthened reviews are scored (statistically significantly) higher quality than the original reviews. In analysis of observational data we find that authors are positively biased towards reviews recommending acceptance of their own papers, even after controlling for confounders of review length, quality, and different numbers of papers per author. We also measure disagreement rates between multiple evaluations of the same review of 28%-32%, which is comparable to that of paper reviewers at NeurIPS. Further, we assess the amount of miscalibration of evaluators of reviews using a linear model of quality scores and find that it is similar to estimates of miscalibration of paper reviewers at NeurIPS. Finally, we estimate the amount of variability in subjective opinions around how to map individual criteria to overall scores of review quality and find that it is roughly the same as that in the review of papers. Our results suggest that the various problems that exist in reviews of papers -- inconsistency, bias towards irrelevant factors, miscalibration, subjectivity -- also arise in reviewing of reviews.
翻译:能否可靠地评估同行评审的质量?我们研究这一问题的动机主要来自两个方面——通过评估评审质量来激励高质量评审,以及在实验中衡量评审质量的变化。我们在机器学习顶级会议NeurIPS 2022上开展了一项大规模研究,邀请(元)评审人和作者对提交论文的评审意见进行评估。首先,我们进行了一项随机对照试验,以检验因评审长度导致的偏差。我们通过添加大量非信息性内容生成了加长版评审意见。对照组参与者评估原始评审,而实验组参与者评估人工加长的版本。我们发现加长版评审的质量得分(在统计上显著)高于原始评审。在观察数据分析中,我们发现作者对建议接受其论文的评审存在正向偏差,即使在控制了评审长度、质量及每位作者论文数量等混杂因素后依然存在。我们还测量了同一篇评审意见多次评估之间的不一致率,达到28%-32%,这与NeurIPS论文评审员的不一致率相当。此外,我们通过质量评分的线性模型评估了评审评估者的校准误差程度,发现其与NeurIPS论文评审员的校准误差估计值相似。最后,我们估算了在如何将个体标准映射到评审质量总体评分这一问题上的主观意见变异量,发现其与论文评审中的变异量大致相同。我们的研究结果表明,论文评审中存在的各种问题——不一致性、对无关因素的偏向性、校准误差、主观性——同样出现在评审的评审过程中。