With the rapid development of NLP, large-scale language models (LLMs) excel in various tasks across multiple domains now. However, existing benchmarks may not adequately measure these models' capabilities, especially when faced with new knowledge. In this paper, we address the lack of benchmarks to evaluate LLMs' ability to handle new knowledge, an important and challenging aspect in the rapidly evolving world. We propose an approach called KnowGen that generates new knowledge by altering existing entity attributes and relationships, resulting in artificial entities that are distinct from real-world entities. With KnowGen, we introduce a benchmark named ALCUNA to assess LLMs' abilities in knowledge understanding, differentiation, and association. We benchmark several LLMs, reveals that their performance in face of new knowledge is not satisfactory, particularly in reasoning between new and internal knowledge. We also explore the impact of entity similarity on the model's understanding of entity knowledge and the influence of contextual entities. We appeal to the need for caution when using LLMs in new scenarios or with new knowledge, and hope that our benchmarks can help drive the development of LLMs in face of new knowledge.
翻译:随着自然语言处理的快速发展,大规模语言模型(LLMs)如今在多个领域的各类任务中表现出色。然而,现有基准测试可能不足以充分衡量这些模型的能力,尤其是在面对新知识时。本文针对评估LLMs处理新知识能力缺乏基准测试的问题——这一在快速发展的世界中至关重要且具挑战性的方面——提出了一种名为KnowGen的方法。该方法通过改变现有实体属性与关系生成新知识,从而创建出区别于真实世界实体的人工实体。基于KnowGen,我们引入了一个名为ALCUNA的基准测试,用于评估LLMs在知识理解、区分与关联方面的能力。我们对多个LLMs进行了基准测试,结果表明它们面对新知识的性能不尽如人意,尤其是在新旧知识之间的推理方面。我们还探讨了实体相似性对模型理解实体知识的影响以及上下文实体的作用。我们呼吁在对新场景或新知识使用LLMs时需保持谨慎,并希望我们的基准测试能推动LLMs在面对新知识领域的发展。