Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.
翻译:特征归因方法已成为解读黑箱模型复杂行为的标准手段。尽管取得了成功,部分学者指出此类方法存在严重缺陷:无法在人类概念层面提供可靠解释。简言之,仅展示特征贡献的阵列不足以让人类推断模型内部表征,确认偏差可能使用户对模型行为产生错误认知。我们认为需要采用结构化方法检验关于模型的假设是否得到特征归因的证实——这正是人类概念与(亚符号)解释之间的"语义匹配"。基于Cinà等人[2023]提出的概念框架,我们提出了一种评估实践中语义匹配的结构化方法。通过在涵盖表格数据和图像数据的系列实验中展示该流程,我们证明了语义匹配评估既能揭示理想模型行为(如聚焦预测相关目标),也能揭示非理想模型行为(如关注虚假关联)。我们将实验结果与语义匹配度量分析相结合,并论证该方法构成了解决可解释人工智能中确认偏差问题的初步步骤。