Robustness has become an important consideration in deep learning. With the help of explainable AI, mismatches between an explained model's decision strategy and the user's domain knowledge (e.g. Clever Hans effects) have been identified as a starting point for improving faulty models. However, it is less clear what to do when the user and the explanation agree. In this paper, we demonstrate that acceptance of explanations by the user is not a guarantee for a machine learning model to be robust against Clever Hans effects, which may remain undetected. Such hidden flaws of the model can nevertheless be mitigated, and we demonstrate this by contributing a new method, Explanation-Guided Exposure Minimization (EGEM), that preemptively prunes variations in the ML model that have not been the subject of positive explanation feedback. Experiments demonstrate that our approach leads to models that strongly reduce their reliance on hidden Clever Hans strategies, and consequently achieve higher accuracy on new data.
翻译:鲁棒性已成为深度学习中的一个重要考量。借助可解释人工智能(XAI),解释模型的决策策略与用户领域知识(例如“聪明汉斯”效应)之间的不匹配已被视为改进缺陷模型的起点。然而,当用户与解释结果一致时,应采取何种措施尚不明确。本文证明,用户对解释结果的接纳并不能保证机器学习模型免受“聪明汉斯”效应的影响,此类效应可能未被检测到。尽管如此,模型中的这类隐藏缺陷仍可被缓解——我们通过提出一种新方法“解释引导的暴露最小化(EGEM)”来证明这一点,该方法能预判性地剪除机器学习模型中未获得正面解释反馈的变异。实验表明,我们的方法能显著降低模型对隐藏“聪明汉斯”策略的依赖,进而在新数据上取得更高准确率。