NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine--tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

翻译：新颖性是学术出版的核心要求，也是同行评审的焦点，然而日益增长的投稿数量给人类审稿人带来了巨大压力。尽管大型语言模型（包括那些基于同行评审数据进行微调的模型）在生成评审意见方面展现出潜力，但缺乏专门的基准限制了对其评估研究新颖性能力的系统性测评。为填补这一空白，我们提出NovBench——首个旨在评估大型语言模型生成新颖性评价以支持人类同行评审的大规模基准数据集。NovBench包含来自顶级自然语言处理会议的1,684对论文-评审记录，涵盖从论文引言中提取的新颖性描述及对应的专家撰写新颖性评价。我们聚焦这两类来源，原因在于引言提供了标准化且明确的新颖性主张阐述，而专家撰写的新颖性评价则构成了当前人类判断的金标准之一。此外，我们提出四维评估框架（包括相关性、正确性、覆盖度和清晰度）以评估语言模型生成新颖性评价的质量。在通用型和专用型语言模型上采用不同提示策略的大规模实验表明，当前模型对科学新颖性的理解有限，且微调模型常存在指令遵循缺陷。这些发现凸显了需要针对性地设计联合提升新颖性理解与指令遵循能力的微调策略。