Concept-based models (CMs), deep neural networks that ground their predictions on representations aligned with human-understandable concepts (e.g., "round", "stripes", etc.), have been shown to learn representations that leak concept-irrelevant information. As the traditional narrative goes, this leakage is undesirable and should be eradicated as it leads to uninterpretable models. In this paper, we posit that this conventional view of leakage in CMs is not only ill-posed, as the evidence of how leakage makes a model less interpretable is often inconclusive, but also bound to lead to impractical CMs under common real-world constraints. Specifically, we argue that in real-world settings where concept incompleteness is the norm, some leakage is often necessary for constructing accurate and intervenable CMs. To this end, we propose that there is such a thing as benign leakage and show that, by optimizing a reframing of the typical CM training objective, CMs can encourage and exploit this form of leakage without sacrificing accuracy or intervenability.
翻译:基于概念模型(CMs)是一类将预测建立在与人可理解概念(例如“圆形”、“条纹”等)对齐表示之上的深度神经网络,已有研究表明这些模型会学习到泄露概念无关信息的表示。按照传统观点,这种信息泄露是不受欢迎的,应予以消除,因为它会导致模型不可解释。本文认为,CMs中这种关于泄露的传统观点不仅不恰当——因为泄露如何导致模型可解释性降低的证据往往不具决定性——而且在实际常见约束下必然导致不实用的CMs。具体而言,我们主张在概念不完整性成为常态的现实场景中,一定程度的泄露对于构建准确且可干预的CMs往往是必要的。为此,我们提出存在良性信息泄露的概念,并表明通过优化典型CM训练目标的重新定义,CMs能够在不牺牲准确性或可干预性的前提下促进并利用这种形式的泄露。