Concept-based Models aim to improve interpretability by predicting high-level intermediate concepts, representing a promising approach for deployment in high-risk scenarios. However, they are known to suffer from information leakage, whereby models exploit unintended information encoded within the learned concepts. We introduce an information-theoretic framework to rigorously characterise and quantify leakage, and define two complementary measures: the concepts-task leakage (CTL) and interconcept leakage (ICL) scores. We show that these measures are strongly predictive of model behaviour under interventions and outperform existing alternatives. Using this framework, we identify the primary causes of leakage and, as a case study, analyse how it manifests in Concept Embedding Models, revealing interconcept and alignment leakage in addition to the concepts-task leakage present by design. Finally, we present a set of practical guidelines for designing concept-based models to reduce leakage and ensure interpretability.
翻译:概念模型旨在通过预测高层中间概念来提升可解释性,这是在高风险场景中部署时极具前景的方法。然而,这类模型存在信息泄漏问题——模型会利用所学概念中编码的意外信息。我们引入了一个信息论框架,用以严格刻画并量化泄漏,并定义了两个互补指标:概念-任务泄漏分数与概念间泄漏分数。研究表明,这两个指标能强有力地预测模型在干预下的行为,且优于现有替代方案。通过该框架,我们识别了泄漏的主要成因,并以概念嵌入模型为案例进行分析,揭示了除设计中存在的概念-任务泄漏外,还涉及概念间泄漏与对齐泄漏。最后,我们提出了设计概念模型以降低泄漏并保障可解释性的实用指南。