Recent studies have evaluated the creativity/novelty of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, accessing the novelty in scholarly publications is a largely unexplored area in evaluating LLMs. In this paper, we introduce a scholarly novelty benchmark (SchNovel) to evaluate LLMs' ability to assess novelty in scholarly papers. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, which simulates the review process taken by human reviewers by leveraging the retrieval of similar papers to assess novelty. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models.
翻译:近期研究主要从语义视角出发,利用认知科学领域的基准测试来评估大型语言模型(LLMs)的创造力/新颖性。然而,在评估LLMs时,如何衡量学术出版物中的新颖性仍是一个尚未充分探索的领域。本文引入了一个学术新颖性基准(SchNovel),用于评估LLMs评判学术论文新颖性的能力。SchNovel包含来自arXiv数据集的六个学科领域、共计15000对论文样本,每对论文的发表时间间隔为2至10年。在每对论文中,我们假定发表时间较晚的论文具有更高新颖性。此外,我们提出了RAG-Novelty方法,该方法通过检索相似论文来评估新颖性,从而模拟人类评审者的审稿流程。大量实验揭示了不同LLMs评估新颖性的能力差异,并证明RAG-Novelty优于近期提出的基线模型。