With the growing popularity of general-purpose Large Language Models (LLMs), comes a need for more global explanations of model behaviors. Concept-based explanations arise as a promising avenue for explaining high-level patterns learned by LLMs. Yet their evaluation poses unique challenges, especially due to their non-local nature and high dimensional representation in a model's hidden space. Current methods approach concepts from different perspectives, lacking a unified formalization. This makes evaluating the core measures of concepts, namely faithfulness or readability, challenging. To bridge the gap, we introduce a formal definition of concepts generalizing to diverse concept-based explanations' settings. Based on this, we quantify the faithfulness of a concept explanation via perturbation. We ensure adequate perturbation in the high-dimensional space for different concepts via an optimization problem. Readability is approximated via an automatic and deterministic measure, quantifying the coherence of patterns that maximally activate a concept while aligning with human understanding. Finally, based on measurement theory, we apply a meta-evaluation method for evaluating these measures, generalizable to other types of explanations or tasks as well. Extensive experimental analysis has been conducted to inform the selection of explanation evaluation measures.
翻译:随着通用大语言模型(LLM)的日益普及,对模型行为的全局解释需求日益增长。基于概念的解释方法成为解释LLM所学高层模式的一种有前景的途径。然而,由于其非局部特性及在模型隐空间中的高维表示,此类解释的评估面临独特挑战。现有方法从不同视角处理概念,缺乏统一的数学形式化框架,这使得评估概念的核心指标——即忠实性或可读性——变得困难。为弥合这一差距,我们提出了一个适用于多样化概念解释场景的通用概念形式化定义。在此基础上,我们通过扰动方法量化概念解释的忠实性,并通过优化问题确保在高维空间中针对不同概念施加充分扰动。可读性则通过自动化确定性度量进行近似评估,该度量量化能最大程度激活概念且符合人类认知的模式连贯性。最后,基于测量理论,我们采用一种元评估方法对这些度量指标进行评价,该方法亦可推广至其他解释类型或任务。我们通过大量实验分析为解释评估指标的选择提供了实证依据。