The rigorous evaluation of the novelty of a scientific paper is, even for human scientists, a challenging task. With the increasing interest in AI scientists and AI involvement in scientific idea generation and paper writing, it also becomes increasingly important that this task be automatable and reliable, lest both human attention and compute tokens be wasted on ideas that have already been explored. Due to the challenge of quantifying ground-truth novelty, however, existing novelty metrics for scientific papers generally validate their results against noisy, confounded signals such as citation counts or peer review scores. These proxies can conflate novelty with impact, quality, or reviewer preference, which in turn makes it harder to assess how well a given metric actually evaluates novelty. We therefore propose an axiomatic benchmark for scientific novelty metrics. We first define a set of axioms that a well-behaved novelty metric should satisfy, grounded in human scientific norms and practice, then evaluate existing metrics across ten tasks spanning three domains of AI research. Our results reveal that no existing metric satisfies all axioms consistently, and that metrics fail on systematically different axioms, reflecting their underlying architectures. Additionally, we show that combining metrics of complementary architectures leads to consistent improvements on the benchmark, with per-axiom weighting achieving 90.1% versus 71.5% for the best individual metric, suggesting that developing architecturally diverse metrics is a promising direction for future work. We release the benchmark code as supplementary material to encourage the development of more robust scientific literature novelty metrics.
翻译:对科学论文新颖性进行严格评估,即便对人类科学家而言也是一项艰巨任务。随着对AI科学家以及AI参与科学构思生成与论文写作的兴趣日益增长,确保该任务可实现自动化且具备可靠性也变得愈发重要——否则,人类注意力与计算资源都将浪费在已被探索过的想法上。然而,由于难以量化真实新颖性,现有科学论文新颖性度量通常借助引用次数、同行评审分数等存在噪声和混淆因素的信号来验证结果。这些代理指标可能将新颖性与影响力、质量或评审者偏好混为一谈,进而使评估特定度量是否真正衡量新颖性变得更加困难。为此,我们提出面向科学新颖性度量的公理基准。首先,基于人类科学规范与实践定义一组良好的新颖性度量应满足的公理,随后在涵盖人工智能研究三个领域的十项任务中评估现有度量。结果表明,没有现有度量能一致满足所有公理,且各度量在不同系统性公理上表现失效,这反映了其底层架构差异。此外,我们证明将具有互补架构的度量进行组合可在基准上实现持续改进:按公理加权后性能达90.1%,而最佳单一度量仅为71.5%,这表明开发架构多样的度量是未来值得探索的方向。我们以补充材料形式发布基准代码,以推动更鲁棒的科学文献新颖性度量研发。