A Gold Standard Dataset for the Reviewer Assignment Problem

Many peer-review venues are either using or looking to use algorithms to assign submissions to reviewers. The crux of such automated approaches is the notion of the "similarity score"--a numerical estimate of the expertise of a reviewer in reviewing a paper--and many algorithms have been proposed to compute these scores. However, these algorithms have not been subjected to a principled comparison, making it difficult for stakeholders to choose the algorithm in an evidence-based manner. The key challenge in comparing existing algorithms and developing better algorithms is the lack of the publicly available gold-standard data that would be needed to perform reproducible research. We address this challenge by collecting a novel dataset of similarity scores that we release to the research community. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers who evaluated their expertise in reviewing papers they have read previously. We use this data to compare several popular algorithms employed in computer science conferences and come up with recommendations for stakeholders. Our main findings are as follows. First, all algorithms make a non-trivial amount of error. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases, highlighting the vital need for more research on the similarity-computation problem. Second, most existing algorithms are designed to work with titles and abstracts of papers, and in this regime the Specter+MFR algorithm performs best. Third, to improve performance, it may be important to develop modern deep-learning based algorithms that can make use of the full texts of papers: the classical TD-IDF algorithm enhanced with full texts of papers is on par with the deep-learning based Specter+MFR that cannot make use of this information.

翻译：许多同行评审平台正在或计划使用算法将投稿分配给审稿人。此类自动化方法的核心是“相似度评分”概念——即审稿人审阅论文专业水平的数值估计，目前已提出多种算法来计算这些分数。然而，这些算法尚未经过严格的比较测试，导致利益相关者难以基于证据选择算法。比较现有算法并开发更优算法的关键障碍在于缺乏公开可用的黄金标准数据，这正是开展可重复研究所必需的。我们通过收集并公开一组新颖的相似度评分数据集来应对这一挑战。该数据集包含58位研究人员提供的477份自我报告的专业水平评分，这些研究人员评估了自身对曾阅读论文的审阅能力。我们利用这些数据比较了计算机科学会议中采用的多种流行算法，并为利益相关者提出建议。主要发现如下：首先，所有算法均存在不可忽视的误差。在按论文与审稿人相关性排序的任务中，简单情况下错误率介于12%-30%，复杂情况下则达36%-43%，这凸显了相似度计算问题亟需更多研究。其次，多数现有算法设计用于处理论文标题和摘要，在此情境下Specter+MFR算法表现最佳。第三，为提升性能，关键可能在于开发能利用论文全文的现代深度学习算法：基于全文增强的经典TF-IDF算法，其表现已与无法利用全文信息的深度学习算法Specter+MFR持平。