The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For ability modeling, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For data, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For evaluation criteria, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge hallucination. We evaluate $21$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.
翻译:大语言模型卓越的性能亟需评估方法的改进。我们认为,与其单纯探索大语言模型的能力广度,细致周到的设计对于实现全面、公正且适用的评估至关重要。基于世界知识对大语言模型的重要性,我们构建了一个面向知识的语言模型评估基准(KoLA),并精心设计了三个关键要素:(1)在能力建模方面,我们模拟人类认知过程,构建了四层级知识相关能力分类体系,涵盖19项任务。(2)在数据方面,为确保公平比较,我们同时采用大语言模型普遍预训练过的Wikipedia语料库与持续收集的新兴语料库,旨在评估模型处理未见数据与演化知识的能力。(3)在评估标准方面,我们采用对比系统,包括用于提升跨任务与模型数值可比性的总体标准分数,以及用于自动评估知识幻觉的独特自对比指标。我们对21个开源及商业大语言模型进行了评估,并获得了若干有趣发现。KoLA数据集与开放参与排行榜已在https://kola.xlore.cn公开发布,并将持续更新,为发展大语言模型及知识相关系统提供参考。