Interpretability is in the Mind of the Beholder: A Causal Framework for Human-interpretable Representation Learning

Focus in Explainable AI is shifting from explanations defined in terms of low-level elements, such as input features, to explanations encoded in terms of interpretable concepts learned from data. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post-hoc explainers and concept-based neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: a representation is understandable only insofar as it can be understood by the human at the receiving end. The key challenge in Human-interpretable Representation Learning (HRL) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring interpretable representations suitable for both post-hoc explainers and concept-based neural networks. Our formalization of HRL builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us to derive a principled notion of alignment between the machine representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive name transfer game, and clarify the relationship between alignment and a well-known property of representations, namely disentanglment. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as concept leakage, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.

翻译：可解释人工智能的焦点正从基于输入特征等低层次要素的定义，转向基于从数据中学习到的可解释概念的编码解释。然而，如何可靠地获取此类概念仍根本不清。目前缺乏关于概念可解释性的公认定义，导致事后解释器和基于概念的神经网络所使用的概念，是通过多种互不相容的策略获取的。关键在于，这些方法大多忽视了问题中人的因素：一个表示只有当被接收端的人类理解时才是可理解的。人类可理解表示学习（HRL）的核心挑战在于如何建模并操作化这一人文要素。本文提出了一种数学框架，用于获取适用于事后解释器和基于概念神经网络的的可解释表示。我们对HRL的形式化建模基于因果表示学习的最新进展，并明确将人类利益相关者作为外部观察者纳入模型。这使我们能够推导出机器表示与人类理解的概念词汇表之间对齐原则性概念。通过一个简单直观的名称传递游戏，我们将对齐与可解释性联系起来，并阐明对齐与表征的著名性质——解耦性——之间的关系。我们还表明，对齐与概念间不良相关性（即概念泄漏）问题以及内容-风格分离问题相关联，所有这些均通过这些性质的一般信息论重构实现。我们的概念化旨在弥合可解释性中人类与算法方面的鸿沟，并为人类可解释表示的新研究建立基石。