This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of *symmetries* that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.
翻译:本文认为人工智能(AI)中的可解释性研究在根本上存在定义缺陷,因为现有可解释性定义未能说明如何对可解释性进行形式化测试或设计。我们提出:可操作的可解释性定义必须基于*对称性*进行表述,这些对称性既能指导模型设计,又能产生可检验的条件。在概率视角下,我们假设四类对称性(推理等变性、信息不变性、概念闭包不变性和结构不变性)足以:(i) 将可解释模型形式化为概率模型的子类,(ii) 将可解释推理(如对齐、干预和反事实)统一表述为贝叶斯反演的形式,(iii) 为验证模型是否符合安全标准和监管要求提供形式化框架。