Despite the surprisingly high intelligence exhibited by Large Language Models (LLMs), we are somehow intimidated to fully deploy them into real-life applications considering their black-box nature. Concept-based explanations arise as a promising avenue for explaining what the LLMs have learned, making them more transparent to humans. However, current evaluations for concepts tend to be heuristic and non-deterministic, e.g. case study or human evaluation, hindering the development of the field. To bridge the gap, we approach concept-based explanation evaluation via faithfulness and readability. We first introduce a formal definition of concept generalizable to diverse concept-based explanations. Based on this, we quantify faithfulness via the difference in the output upon perturbation. We then provide an automatic measure for readability, by measuring the coherence of patterns that maximally activate a concept. This measure serves as a cost-effective and reliable substitute for human evaluation. Finally, based on measurement theory, we describe a meta-evaluation method for evaluating the above measures via reliability and validity, which can be generalized to other tasks as well. Extensive experimental analysis has been conducted to validate and inform the selection of concept evaluation measures.
翻译:尽管大型语言模型(LLMs)展现出惊人的智能,但由于其黑箱特性,我们在将其完全部署到实际应用中时仍心存顾虑。基于概念的解释应运而生,成为解释LLMs所学内容、使其对人类更加透明的一种有前景的方法。然而,目前对概念的评估往往是启发式的且非确定性的(例如案例研究或人工评估),这阻碍了该领域的发展。为弥补这一差距,我们通过忠实性和可读性来探讨基于概念的解释评估。首先,我们引入了可推广到多种基于概念解释的概念形式化定义。基于此,我们通过扰动后输出的差异量化忠实性。随后,我们提供了一种自动化的可读性度量方法,通过衡量最大化激活某一概念的模式的一致性来实现。该度量可作为人工评估的经济高效且可靠的替代方案。最后,基于测量理论,我们描述了一种通过可靠性和效度对上述度量进行元评估的方法,该方法也可推广至其他任务。我们进行了大量实验分析,以验证并指导概念评估度量的选择。