This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of *symmetries* that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.
翻译:本文认为,人工智能(AI)领域的可解释性研究在根本上是不适定的,因为现有可解释性定义未能阐明如何对可解释性进行形式化验证或设计。我们主张,可操作的可解释性定义必须基于能够指导模型设计并产生可检验条件的*对称性*来构建。在概率视角下,我们假设四种对称性(推理等变性、信息不变性、概念闭包不变性与结构不变性)足以实现以下目标:(i)将可解释模型形式化为概率模型的子类;(ii)将可解释推理(如对齐、干预与反事实推理)统一表述为贝叶斯逆问题的一种形式;(iii)为验证安全标准与法规的合规性提供形式化框架。