The most common methods in explainable artificial intelligence are post-hoc techniques which identify the most relevant features used by pretrained opaque models. Some of the most advanced post hoc methods can generate explanations that account for the mutual interactions of input features in the form of logic rules. However, these methods frequently fail to guarantee the consistency of the extracted explanations with the model's underlying reasoning. To bridge this gap, we propose a theoretically grounded approach to ensure coherence and fidelity of the extracted explanations, moving beyond the limitations of current heuristic-based approaches. To this end, drawing from category theory, we introduce an explaining functor which structurally preserves logical entailment between the explanation and the opaque model's reasoning. As a proof of concept, we validate the proposed theoretical constructions on a synthetic benchmark verifying how the proposed approach significantly mitigates the generation of contradictory or unfaithful explanations.
翻译:可解释人工智能中最常见的方法是事后技术,这些技术识别预训练不透明模型所使用的最相关特征。一些最先进的事后方法能够以逻辑规则的形式生成考虑输入特征相互作用的解释。然而,这些方法往往无法保证提取的解释与模型底层推理的一致性。为弥补这一差距,我们提出了一种理论依据充分的方法,以确保提取解释的连贯性和保真度,超越当前基于启发式方法的局限。为此,我们借鉴范畴论,引入了一种解释函子,该函子在结构上保持了解释与不透明模型推理之间的逻辑蕴涵关系。作为概念验证,我们在一个合成基准上验证了所提出的理论构造,证实了该方法如何显著减少矛盾或不忠实解释的生成。