Explainable artificial intelligence (XAI) methods are portrayed as a remedy for debugging and trusting statistical and deep learning models, as well as interpreting their predictions. However, recent advances in adversarial machine learning (AdvML) highlight the limitations and vulnerabilities of state-of-the-art explanation methods, putting their security and trustworthiness into question. The possibility of manipulating, fooling or fairwashing evidence of the model's reasoning has detrimental consequences when applied in high-stakes decision-making and knowledge discovery. This survey provides a comprehensive overview of research concerning adversarial attacks on explanations of machine learning models, as well as fairness metrics. We introduce a unified notation and taxonomy of methods facilitating a common ground for researchers and practitioners from the intersecting research fields of AdvML and XAI. We discuss how to defend against attacks and design robust interpretation methods. We contribute a list of existing insecurities in XAI and outline the emerging research directions in adversarial XAI (AdvXAI). Future work should address improving explanation methods and evaluation protocols to take into account the reported safety issues.
翻译:可解释人工智能(XAI)方法被视为调试和信任统计模型及深度学习模型、解释其预测结果的解决方案。然而,对抗机器学习(AdvML)的最新进展揭示了当前最先进解释方法的局限性与脆弱性,对其安全性和可信度提出了质疑。在应用于高风险决策制定和知识发现时,操纵、欺骗或掩盖模型推理证据的可能性将产生有害后果。本综述全面概述了关于机器学习模型解释与公平性度量的对抗攻击研究。我们提出统一的符号体系和分类法,为来自AdvML与XAI交叉领域的研究人员和实践者提供共同基础。本文探讨了如何防御攻击并设计鲁棒的解释方法,列出了XAI中现有的安全性问题,并概述了对抗性XAI(AdvXAI)的新兴研究方向。未来工作应着眼于改进解释方法和评估协议,以应对已报告的安全问题。