KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Jifan Yu,Xiaozhi Wang,Shangqing Tu,Shulin Cao,Daniel Zhang-Li,Xin Lv,Hao Peng,Zijun Yao,Xiaohan Zhang,Hanming Li,Chunyang Li,Zheyuan Zhang,Yushi Bai,Yantao Liu,Amy Xin,Nianyi Lin,Kaifeng Yun,Linlu Gong,Jianhui Chen,Zhili Wu,Yunjia Qi,Weikai Li,Yong Guan,Kaisheng Zeng,Ji Qi,Hailong Jin,Jinxin Liu,Yu Gu,Yuan Yao,Ning Ding,Lei Hou,Zhiyuan Liu,Bin Xu,Jie Tang,Juanzi Li

from arxiv, Accepted by ICLR 2024

The unprecedented performance of large language models (LLMs) necessitates improvements in evaluations. Rather than merely exploring the breadth of LLM abilities, we believe meticulous and thoughtful designs are essential to thorough, unbiased, and applicable evaluations. Given the importance of world knowledge to LLMs, we construct a Knowledge-oriented LLM Assessment benchmark (KoLA), in which we carefully design three crucial factors: (1) For \textbf{ability modeling}, we mimic human cognition to form a four-level taxonomy of knowledge-related abilities, covering $19$ tasks. (2) For \textbf{data}, to ensure fair comparisons, we use both Wikipedia, a corpus prevalently pre-trained by LLMs, along with continuously collected emerging corpora, aiming to evaluate the capacity to handle unseen data and evolving knowledge. (3) For \textbf{evaluation criteria}, we adopt a contrastive system, including overall standard scores for better numerical comparability across tasks and models and a unique self-contrast metric for automatically evaluating knowledge-creating ability. We evaluate $28$ open-source and commercial LLMs and obtain some intriguing findings. The KoLA dataset and open-participation leaderboard are publicly released at https://kola.xlore.cn and will be continuously updated to provide references for developing LLMs and knowledge-related systems.

翻译：大语言模型（LLM）前所未有的性能表现对评估方法提出了更高的要求。我们认为，全面、无偏见且具有适用性的评估不仅需要探索LLM能力的广度，更依赖于细致周密的设计。鉴于世界知识对LLM的重要性，我们构建了面向知识的LLM评估基准（KoLA），其中精心设计了三个关键要素：（1）在**能力建模**方面，我们模拟人类认知构建了包含四个层次的知识相关能力分类体系，涵盖$19$项任务。（2）在**数据**方面，为确保公平比较，我们同时采用被LLM普遍预训练的维基百科语料库与持续收集的新兴语料库，旨在评估模型处理未见数据与演进知识的能力。（3）在**评估标准**方面，我们采用对比式评估体系，包括用于提升跨任务与跨模型数值可比性的整体标准分，以及用于自动评估知识创造能力的独特自对比指标。我们对$28$个开源与商业LLM进行了评估，并获得了若干具有启发性的发现。KoLA数据集与开放参与排行榜已在 https://kola.xlore.cn 公开发布，并将持续更新，为LLM及知识相关系统的开发提供参考依据。