Despite the surprisingly high intelligence exhibited by Large Language Models (LLMs), we are somehow intimidated to fully deploy them into real-life applications considering their black-box nature. Concept-based explanations arise as a promising avenue for explaining what the LLMs have learned, making them more transparent to humans. However, current evaluations for concepts tend to be heuristic and non-deterministic, e.g. case study or human evaluation, hindering the development of the field. To bridge the gap, we approach concept-based explanation evaluation via faithfulness and readability. We first introduce a formal definition of concept generalizable to diverse concept-based explanations. Based on this, we quantify faithfulness via the difference in the output upon perturbation. We then provide an automatic measure for readability, by measuring the coherence of patterns that maximally activate a concept. This measure serves as a cost-effective and reliable substitute for human evaluation. Finally, based on measurement theory, we describe a meta-evaluation method for evaluating the above measures via reliability and validity, which can be generalized to other tasks as well. Extensive experimental analysis has been conducted to validate and inform the selection of concept evaluation measures.
翻译:尽管大语言模型展现出令人惊讶的高智能,但其黑箱特性仍使我们在将其全面部署至实际应用时感到踌躇。基于概念的解释作为揭示大语言模型所学知识的有效途径,有望提升模型对人类的可解释性。然而,当前对概念的评估往往依赖于启发式和非确定性方法(例如案例研究或人工评估),这阻碍了该领域的发展。为弥合这一差距,我们通过忠实度和可读性来评估基于概念的解释。首先,我们提出一个可适用于多种基于概念解释的通用概念形式化定义。基于此,我们通过扰动后输出的差异来量化忠实度。随后,我们通过测量最大程度激活某概念的模式的连贯性,提出一种自动化的可读性度量方法。该方法可作为人工评估的低成本、高可靠性替代方案。最后,基于测量理论,我们描述了一种通过信度和效度评估上述度量的元评估方法,该方法亦可推广至其他任务。我们进行了大量实验分析,以验证并指导概念评估度量的选择。