Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.
翻译:特征归因方法已成为解析黑箱模型复杂行为的主要手段。尽管取得了成功,一些学者认为这些方法存在严重缺陷:它们无法基于人类概念提供可靠解释。简言之,仅呈现特征贡献的排列不足以使人类推断模型的内部表示,而确认偏差可能误导用户对模型行为产生错误认知。我们认为,需要采用结构化方法来验证人类对模型的假设是否得到特征归因的确认,即人类概念与(次符号)解释之间的"语义匹配"。基于Cinà等人[2023]提出的概念框架,我们提出了一种评估实践中语义匹配的结构化方法。通过在涵盖表格数据和图像数据的系列实验中展示该流程,我们揭示了语义匹配评估如何洞察模型行为中的可取特征(例如聚焦于预测相关对象)与不良特征(例如关注虚假相关性)。最后,将实验结果与语义匹配度量指标分析相结合,论证该方法构成了解决可解释人工智能领域确认偏差问题的第一步。