Assessing originality in AI research is arguably the most consequential yet least reliable step in peer review. Reviewer judgments of originality remain opaque, inconsistent, and dependent on comparisons to prior work that are often incomplete. In this paper, we present a large-scale, data-driven qualitative and quantitative analysis of research originality based on over 100,000 peer-review reports from leading AI venues, spanning a period of rapid growth in the field. Leveraging structured, semantically retrieved prior work and signals embedded in expert reviewer assessments, we systematically characterize how originality is perceived in practice and identify the key dimensions that most strongly influence novelty judgments. Our analysis yields a fine-grained, evidence-based framework that equips both authors and reviewers with actionable insights into how originality is evaluated. In addition, we evaluate the reliability of current large language model (LLM) agents in assessing originality. We find that these models tend to systematically overestimate novelty and struggle to detect conceptual plagiarism, particularly in the presence of paraphrasing. We release our dataset, trained models, and code at: https://anonymous.4open.science/r/Novelty-Reviewer-365C/.
翻译:评估AI研究的原创性,或许是同行评审中最重要却最不可靠的环节。审稿人对原创性的判断往往不透明、不一致,且依赖于对过往工作的对比,而这种对比常常不完整。本文基于来自顶级AI学术会议的超10万份同行评审报告,对研究原创性进行了大规模、数据驱动的定性与定量分析,覆盖了该领域快速发展的时期。通过利用结构化、语义检索的已有工作,以及嵌入在专家审稿人评估中的信号,我们系统地描述了实践中原创性是如何被感知的,并识别出对新颖性判断影响最强的关键维度。我们的分析构建了一个细粒度、基于证据的框架,为作者和审稿人提供了关于原创性如何被评估的可操作见解。此外,我们还评估了当前大语言模型(LLM)智能体在评估原创性方面的可靠性。我们发现,这些模型倾向于系统性地高估新颖性,且在检测概念抄袭方面存在困难,尤其是在存在释义改写的情况下。我们将数据集、训练模型及代码发布在:https://anonymous.4open.science/r/Novelty-Reviewer-365C/。