The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate $21$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.
翻译:大型语言模型展现出的前所未有的性能,要求对评估方法进行改进。我们坚信,与其仅仅探索LLM能力的广度,不如通过细致周到的设计来实现全面、无偏且适用的评估。鉴于世界知识对LLM的重要性,我们构建了面向知识的LLM评估基准(KoLA),其中精心设计了三个关键因素:(1)在能力建模方面,我们模拟人类认知,形成包含19个任务的知识相关能力四级分类体系。(2)在数据方面,为确保公平比较,我们同时使用LLM普遍预训练的语料库维基百科,以及持续收集的新兴语料库,旨在评估处理未见数据和演化知识的能力。(3)在评价标准方面,我们采用对比系统,包括用于任务和模型间更好数值可比性的总体标准分数,以及用于自动评估知识幻觉的独特自对比指标。我们评估了21个开源和商业LLM,获得了一些有趣发现。KoLA数据集和开放参与排行榜已在https://kola.xlore.cn公开发布,并将持续更新,为LLM及知识相关系统的开发提供参考。