Disagreement amongst counterfactual explanations: How transparency can be deceptive

Counterfactual explanations are increasingly used as an Explainable Artificial Intelligence (XAI) technique to provide stakeholders of complex machine learning algorithms with explanations for data-driven decisions. The popularity of counterfactual explanations resulted in a boom in the algorithms generating them. However, not every algorithm creates uniform explanations for the same instance. Even though in some contexts multiple possible explanations are beneficial, there are circumstances where diversity amongst counterfactual explanations results in a potential disagreement problem among stakeholders. Ethical issues arise when for example, malicious agents use this diversity to fairwash an unfair machine learning model by hiding sensitive features. As legislators worldwide tend to start including the right to explanations for data-driven, high-stakes decisions in their policies, these ethical issues should be understood and addressed. Our literature review on the disagreement problem in XAI reveals that this problem has never been empirically assessed for counterfactual explanations. Therefore, in this work, we conduct a large-scale empirical analysis, on 40 datasets, using 12 explanation-generating methods, for two black-box models, yielding over 192.0000 explanations. Our study finds alarmingly high disagreement levels between the methods tested. A malicious user is able to both exclude and include desired features when multiple counterfactual explanations are available. This disagreement seems to be driven mainly by the dataset characteristics and the type of counterfactual algorithm. XAI centers on the transparency of algorithmic decision-making, but our analysis advocates for transparency about this self-proclaimed transparency

翻译：反事实解释作为一种可解释人工智能技术，正被越来越多地用于为复杂机器学习算法的利益相关者提供数据驱动决策的解释。反事实解释的广泛应用催生了大量生成此类解释的算法。然而，并非所有算法都能为同一实例生成一致的解释。尽管在某些场景下多个可能的解释具有益处，但在某些情况下，反事实解释之间的多样性会导致利益相关者之间潜在的冲突问题。例如，恶意行为者利用这种多样性，通过隐藏敏感特征来粉饰不公平的机器学习模型，由此引发伦理问题。随着全球立法者倾向于将数据驱动高风险决策的解释权纳入政策体系，这些伦理问题亟待理解与解决。我们对可解释人工智能中"冲突问题"的文献综述表明，这一问题从未针对反事实解释进行过实证评估。因此，本研究基于40个数据集、12种解释生成方法、两种黑箱模型，开展了大规模实证分析，共生成超过1920000条解释。研究发现，测试方法之间的冲突程度惊人地高。当存在多个反事实解释时，恶意用户既可能排除也可能包含期望的特征。这种冲突主要受数据集特征和反事实算法类型驱动。可解释人工智能的核心在于算法决策的透明度，但我们的分析呼吁对这种自我标榜的透明度保持清醒认知。