Explainable AI has become a popular tool for validating machine learning models. Mismatches between the explained model's decision strategy and the user's domain knowledge (e.g. Clever Hans effects) have also been recognized as a starting point for improving faulty models. However, it is less clear what to do when the user and the explanation agree. In this paper, we demonstrate that acceptance of explanations by the user is not a guarantee for a ML model to function well, in particular, some Clever Hans effects may remain undetected. Such hidden flaws of the model can nevertheless be mitigated, and we demonstrate this by contributing a new method, Explanation-Guided Exposure Minimization (EGEM), that premptively prunes variations in the ML model that have not been the subject of positive explanation feedback. Experiments on natural image data demonstrate that our approach leads to models that strongly reduce their reliance on hidden Clever Hans strategies, and consequently achieve higher accuracy on new data.
翻译:可解释人工智能已成为验证机器学习模型的流行工具。解释模型的决策策略与用户领域知识之间的不匹配(例如“巧妙汉斯”效应)也被视为改进有缺陷模型的起点。然而,当用户与解释达成一致时,应如何行动尚不明确。本文证明,用户对解释的接受并不能保证机器学习模型正常运行——某些“巧妙汉斯”效应可能仍未被发现。尽管如此,模型的此类隐藏缺陷仍可得到缓解。我们通过提出新方法——解释引导的暴露最小化(EGEM)——展示了这一点,该方法能预先剪除未获得正面解释反馈的机器学习模型中的变异。自然图像数据实验表明,所提方法可显著减少模型对隐藏的“巧妙汉斯”策略的依赖,从而在新数据上获得更高的准确率。