This paper argues that interpretability research in Artificial Intelligence is fundamentally ill-posed as existing definitions of interpretability are not *actionable*: they fail to provide formal principles from which concrete modelling and inferential rules can be derived. We posit that for a definition of interpretability to be actionable, it must be given in terms of *symmetries*. We hypothesise that four symmetries suffice to (i) motivate core interpretability properties, (ii) characterize the class of interpretable models, and (iii) derive a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion.
翻译:本文认为,人工智能领域的可解释性研究在根本上是不适定的,因为现有的可解释性定义不具备*可操作性*:它们未能提供可推导出具体建模与推理规则的形式化原则。我们主张,要使可解释性的定义具备可操作性,就必须基于*对称性*来构建。我们假设,四种对称性足以(i)激发核心的可解释性特性,(ii)刻画可解释模型的类别,以及(iii)推导出将可解释性推理(例如,对齐、干预和反事实)作为一种贝叶斯逆问题形式的统一表述。