Peer review remains the central quality-control mechanism of science, yet its ability to fulfill this role is increasingly strained. Empirical studies document serious shortcomings: long publication delays, escalating reviewer burden concentrated on a small minority of scholars, inconsistent quality and low inter-reviewer agreement, and systematic biases by gender, language, and institutional prestige. Decades of human-centered reforms have yielded only marginal improvements. Meanwhile, artificial intelligence, especially large language models (LLMs), is being piloted across the peer-review pipeline by journals, funders, and individual reviewers. Early studies suggest that AI assistance can produce reviews comparable in quality to humans, accelerate reviewer selection and feedback, and reduce certain biases, but also raise distinctive concerns about hallucination, confidentiality, gaming, novelty recognition, and loss of trust. In this paper, we map the aims and persistent failure modes of peer review to specific LLM applications and systematically analyze the objections they raise alongside safeguards that could make their use acceptable. Drawing on emerging evidence, we show that targeted, supervised LLM assistance can plausibly improve error detection, timeliness, and reviewer workload without displacing human judgment. We highlight advanced architectures, including fine-tuned, retrieval-augmented, and multi-agent systems, that may enable more reliable, auditable, and interdisciplinary review. We argue that ethical and practical considerations are not peripheral but constitutive: the legitimacy of AI-assisted peer review depends on governance choices as much as technical capacity. The path forward is neither uncritical adoption nor reflexive rejection, but carefully scoped pilots with explicit evaluation metrics, transparency, and accountability.
翻译:同行评议仍是科学质量管控的核心机制,但其履行这一职能的能力正日益承压。实证研究揭示了若干严重缺陷:漫长的发表延迟、集中于少数研究者的日益加重的评审负担、评审质量参差不齐且评审者间一致性低,以及因性别、语言和机构声望导致的系统性偏见。数十年以人为中心的改革仅带来边际改善。与此同时,人工智能,尤其是大语言模型(LLMs),正被期刊、资助机构和个体评审者在同行评议全流程中试点应用。早期研究表明,AI辅助生成的评审意见在质量上可与人类评审相媲美,能加速评审者遴选与反馈流程,并减少特定偏见,但也引发了关于幻觉、保密性、博弈行为、创新性识别以及信任流失等独特问题。本文通过将同行评议的目标与固有缺陷映射至具体的LLM应用场景,系统分析了相关质疑以及可能使其应用被接受的保障措施。基于新兴证据,我们表明在受监督的前提下,针对性地使用LLM辅助,有望在无需取代人类判断的情况下,切实改进错误检测、时效性和评审工作量。我们重点探讨了包括微调模型、检索增强生成和多智能体系统在内的先进架构,这些架构可能实现更可靠、可审计且跨学科的评审。我们认为,伦理与实践考量并非边缘问题而是构成性要素:AI辅助同行评议的合法性既取决于技术能力,也同等依赖于治理选择。未来之路既非盲目采用亦非条件反射式拒绝,而应是在明确评估指标、透明度和问责制框架下,开展审慎界定的试点研究。