With the increasing amount of problematic peer reviews in top AI conferences, the community is urgently in need of automatic quality control measures. In this paper, we restrict our attention to substantiation -- one popular quality aspect indicating whether the claims in a review are sufficiently supported by evidence -- and provide a solution automatizing this evaluation process. To achieve this goal, we first formulate the problem as claim-evidence pair extraction in scientific peer reviews, and collect SubstanReview, the first annotated dataset for this task. SubstanReview consists of 550 reviews from NLP conferences annotated by domain experts. On the basis of this dataset, we train an argument mining system to automatically analyze the level of substantiation in peer reviews. We also perform data analysis on the SubstanReview dataset to obtain meaningful insights on peer reviewing quality in NLP conferences over recent years.
翻译:随着顶会人工智能会议中问题评审的日益增多,学界亟需自动化质量控制措施。本文聚焦论证充实度这一关键质量维度——即评价评审意见中的主张是否获得充分证据支持——并提出自动化评估方案。为此,我们首先将问题形式化为科学同行评审中的主张-证据对抽取任务,并构建首个面向该任务的标注数据集SubstanReview。该数据集包含550篇来自NLP领域会议的评审文本,均由领域专家进行标注。基于此数据集,我们训练论证挖掘系统以自动分析同行评审的论证充实度水平。同时,通过对SubstanReview数据集的统计分析,本文揭示了近年来NLP会议同行评审质量的重要特征。