Automated peer review has evolved from simple text classification to structured feedback generation. However, current state-of-the-art systems still struggle with "surface-level" critiques: they excel at summarizing content but often fail to accurately assess novelty and significance or identify deep methodological flaws because they evaluate papers in a vacuum, lacking the external context a human expert possesses. In this paper, we introduce ScholarPeer, a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher. ScholarPeer employs a dual-stream process of context acquisition and active verification. It dynamically constructs a domain narrative using a historian agent, identifies missing comparisons via a baseline scout, and verifies claims through a multi-aspect Q&A engine, grounding the critique in live web-scale literature. We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations and reduces the gap to human-level diversity.
翻译:自动化同行评审已从简单的文本分类演变为结构化反馈生成。然而,当前最先进的系统仍受限于"表层"批评:它们擅长总结内容,却常因在孤立环境中评估论文而难以准确评估新颖性与重要性,或识别深层方法论缺陷,缺乏人类专家所具备的外部情境认知。本文提出ScholarPeer,一种支持搜索的多智能体框架,旨在模拟资深研究者的认知过程。该框架采用情境获取与主动验证的双流处理机制:通过历史学家智能体动态构建领域叙事,借助基线侦察员识别缺失对比,并利用多维度问答引擎验证学术主张,使批评建立在实时网络级文献基础上。我们在DeepReview-13K数据集上评估ScholarPeer,实验结果表明:在并行对比评估中,该框架相对于现有最优方法获得显著胜率,并有效缩小了与人类评审多样性水平的差距。