Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out-of-distribution inputs. In this exploratory review, we explore the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios, in which the inputs, the output classifications and the explanations of the model's decisions are assessed by humans. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment, introducing and illustrating novel attack paradigms. In particular, our framework considers a wide range of relevant yet often ignored factors such as the type of problem, the user expertise or the objective of the explanations, in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). The intention of these contributions is to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.
翻译:在神经网络等机器学习模型的可靠部署中,由于存在若干局限性,依然面临挑战。其中主要缺陷包括缺乏可解释性,以及对抗样本或分布外输入鲁棒性不足。在这篇探索性综述中,我们探讨了针对可解释机器学习模型的对抗攻击的可能性与局限性。首先,我们将对抗样本的概念扩展到适用于可解释机器学习场景,在该场景中,模型决策的输入、输出分类及解释均由人类评估。接着,我们提出一个综合框架,用于研究在人类评估下,是否(以及如何)能为可解释模型生成对抗样本,并引入和阐述了新型攻击范式。特别是,我们的框架考虑了问题类型、用户专业水平或解释目标等一系列重要却常被忽视的因素,以便识别在每种场景中为成功欺骗模型(及人类)而应采用的最优攻击策略。这些贡献旨在为可解释机器学习领域中对对抗样本更严谨、更现实的研究奠定基础。