Model interpretability plays a central role in human-AI decision-making systems. Ideally, explanations should be expressed using human-interpretable semantic concepts. Moreover, the causal relations between these concepts should be captured by the explainer to allow for reasoning about the explanations. Lastly, explanation methods should be efficient and not compromise the performance of the predictive task. Despite the rapid advances in AI explainability in recent years, as far as we know to date, no method fulfills these three properties. Indeed, mainstream methods for local concept explainability do not produce causal explanations and incur a trade-off between explainability and prediction performance. We present DiConStruct, an explanation method that is both concept-based and causal, with the goal of creating more interpretable local explanations in the form of structural causal models and concept attributions. Our explainer works as a distillation model to any black-box machine learning model by approximating its predictions while producing the respective explanations. Because of this, DiConStruct generates explanations efficiently while not impacting the black-box prediction task. We validate our method on an image dataset and a tabular dataset, showing that DiConStruct approximates the black-box models with higher fidelity than other concept explainability baselines, while providing explanations that include the causal relations between the concepts.
翻译:模型可解释性在人类-人工智能决策系统中扮演核心角色。理想情况下,解释应使用人类可理解的语义概念表达,同时解释器需捕捉这些概念间的因果关系以支持对解释的推理。此外,解释方法应高效且不影响预测任务的性能。尽管近年来AI可解释性领域发展迅速,但据我们所知,迄今为止尚无方法同时满足这三个特性。实际上,主流的局部概念可解释性方法既无法生成因果解释,又会在可解释性与预测性能之间产生权衡。我们提出DiConStruct——一种兼具概念基础与因果性的解释方法,旨在以结构因果模型和概念归因的形式创建更具可解释性的局部解释。该解释器通过近似任意黑箱机器学习模型的预测并生成相应解释,作为其蒸馏模型运行。正因如此,DiConStruct在高效生成解释的同时不会影响黑箱预测任务。我们在图像数据集和表格数据集上验证了该方法,结果表明:DiConStruct不仅能以比其他概念可解释性基线更高的保真度逼近黑箱模型,还提供了包含概念间因果关系的解释。