Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. However, such a paradigm fails to comprehensively differentiate the fine-grained language and cognitive skills, rendering the lack of sufficient interpretation to LLMs' capabilities. In this paper, we present FAC$^2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation. Specifically, we formulate LLMs' evaluation in a multi-dimensional and explainable manner by dissociating the language-related capabilities and the cognition-related ones. Besides, through extracting the intermediate reasoning from LLMs, we further break down the process of applying a specific capability into three sub-steps: recalling relevant knowledge, utilizing knowledge, and solving problems. Finally, FAC$^2$E evaluates each sub-step of each fine-grained capability, providing a two-faceted diagnosis for LLMs. Utilizing FAC$^2$E, we identify a common shortfall in knowledge utilization among models and propose a straightforward, knowledge-enhanced method to mitigate this issue. Our results not only showcase promising performance enhancements but also highlight a direction for future LLM advancements.
翻译:大语言模型(LLMs)主要通过在各种文本理解与生成任务上的整体表现进行评估。然而,这种范式未能全面区分细粒度的语言技能与认知技能,导致对LLMs能力的解释不足。本文提出FAC$^2$E,一个面向细粒度与认知基础的大语言模型能力评估框架。具体而言,我们通过解耦语言相关能力与认知相关能力,以多维度且可解释的方式构建LLMs的评估体系。此外,通过提取LLMs的中间推理过程,我们将特定能力的应用过程进一步分解为三个子步骤:回忆相关知识、利用知识以及解决问题。最终,FAC$^2$E对每个细粒度能力的各个子步骤进行评估,为LLMs提供双维度的诊断。利用FAC$^2$E,我们识别出模型在知识利用方面普遍存在的不足,并提出了一种简单直接的知识增强方法来缓解此问题。我们的实验结果不仅展示了有前景的性能提升,也为未来LLM的发展指明了方向。