Feature attribution methods have become a staple method to disentangle the complex behavior of black box models. Despite their success, some scholars have argued that such methods suffer from a serious flaw: they do not allow a reliable interpretation in terms of human concepts. Simply put, visualizing an array of feature contributions is not enough for humans to conclude something about a model's internal representations, and confirmation bias can trick users into false beliefs about model behavior. We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions. This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations. Building on the conceptual framework put forward in Cin\`a et al. [2023], we propose a structured approach to evaluate semantic match in practice. We showcase the procedure in a suite of experiments spanning tabular and image data, and show how the assessment of semantic match can give insight into both desirable (e.g., focusing on an object relevant for prediction) and undesirable model behaviors (e.g., focusing on a spurious correlation). We couple our experimental results with an analysis on the metrics to measure semantic match, and argue that this approach constitutes the first step towards resolving the issue of confirmation bias in XAI.
翻译:特征归因方法已成为解释黑箱模型复杂行为的常用手段。尽管这些方法取得了成功,但部分学者指出其存在严重缺陷:无法基于人类概念进行可靠解释。简言之,可视化特征贡献矩阵不足以让人类推断模型内部表征,而确认偏差可能误导用户对模型行为形成错误认知。我们认为需要结构化方法验证模型假设是否得到特征归因的确认,即人类概念与(亚符号)解释之间的"语义匹配"。基于Cinà等人[2023]提出的概念框架,我们设计了一套评估语义匹配的实践方案。通过涵盖表格与图像数据的系列实验,我们展示了语义匹配评估如何揭示合理行为(如聚焦预测相关目标)与不良行为(如依赖虚假相关性)。结合实验结果,我们分析了语义匹配的度量指标,并论证该方法构成消除XAI中确认偏差问题的第一步。