Explainable AI has become a popular tool for validating machine learning models. Mismatches between the explained model's decision strategy and the user's domain knowledge (e.g. Clever Hans effects) have also been recognized as a starting point for improving faulty models. However, it is less clear what to do when the user and the explanation agree. In this paper, we demonstrate that acceptance of explanations by the user is not a guarantee for a machine learning model to function well, in particular, some Clever Hans effects may remain undetected. Such hidden flaws of the model can nevertheless be mitigated, and we demonstrate this by contributing a new method, Explanation-Guided Exposure Minimization (EGEM), that preemptively prunes variations in the ML model that have not been the subject of positive explanation feedback. Experiments on natural image data demonstrate that our approach leads to models that strongly reduce their reliance on hidden Clever Hans strategies, and consequently achieve higher accuracy on new data.
翻译:可解释人工智能已成为验证机器学习模型的流行工具。解释模型的决策策略与用户领域知识(如"聪明汉斯"效应)之间的不匹配,也被视为改进缺陷模型的切入点。然而,当用户与解释结果达成一致时,学界尚不明确应当采取何种措施。本文证明,用户对解释结果的接受并不能保证机器学习模型正常运转——某些"聪明汉斯"效应可能仍未被察觉。尽管此类隐藏缺陷仍可缓解,我们为此提出新方法"解释引导暴露最小化"(EGEM),主动剪枝机器学习模型中未获得正向解释反馈的变量。在自然图像数据上的实验表明,该方法能显著降低模型对隐藏"聪明汉斯"策略的依赖,进而在新数据上取得更高准确率。