Evaluating the Ability of Explanations to Disambiguate Models in a Rashomon Set

from arxiv, This is a preprint of the paper published at the MURE workshop, AAAI 2026, which builds on a preprint of separate work published at FAccT 2025 (arXiv:2505.10399)

Explainable artificial intelligence (XAI) is concerned with producing explanations indicating the inner workings of models. For a Rashomon set of similarly performing models, explanations provide a way of disambiguating the behavior of individual models, helping select models for deployment. However explanations themselves can vary depending on the explainer used, and need to be evaluated. In the paper "Evaluating Model Explanations without Ground Truth", we proposed three principles of explanation evaluation and a new method "AXE" to evaluate the quality of feature-importance explanations. We go on to illustrate how evaluation metrics that rely on comparing model explanations against ideal ground truth explanations obscure behavioral differences within a Rashomon set. Explanation evaluation aligned with our proposed principles would highlight these differences instead, helping select models from the Rashomon set. The selection of alternate models from the Rashomon set can maintain identical predictions but mislead explainers into generating false explanations, and mislead evaluation methods into considering the false explanations to be of high quality. AXE, our proposed explanation evaluation method, can detect this adversarial fairwashing of explanations with a 100% success rate. Unlike prior explanation evaluation strategies such as those based on model sensitivity or ground truth comparison, AXE can determine when protected attributes are used to make predictions.

翻译：可解释人工智能（XAI）致力于生成能够揭示模型内部工作机制的解释。对于一个性能相近的模型所构成的拉什莫尔集合，解释提供了一种区分单个模型行为的方法，有助于选择用于部署的模型。然而，解释本身会因所使用的解释器不同而产生差异，因此需要对其进行评估。在论文《无真实基准下的模型解释评估》中，我们提出了解释评估的三项原则及一种评估特征重要性解释质量的新方法“AXE”。我们进一步说明，依赖将模型解释与理想真实基准解释进行比较的评估指标，会掩盖拉什莫尔集合内的行为差异。而遵循我们提出原则的解释评估则会凸显这些差异，从而有助于从拉什莫尔集合中选择模型。从拉什莫尔集合中选择替代模型可以保持完全相同的预测结果，但会误导解释器生成错误的解释，并误导评估方法认为这些错误解释具有高质量。我们提出的解释评估方法AXE能够以100%的成功率检测到这种对抗性的解释“漂白”行为。与先前基于模型敏感性或真实基准比较的解释评估策略不同，AXE能够判定受保护属性是否被用于进行预测。