Adversarial attacks on explainability models have drastic consequences when explanations are used to understand the reasoning of neural networks in safety critical systems. Path methods are one such class of attribution methods susceptible to adversarial attacks. Adversarial learning is typically phrased as a constrained optimisation problem. In this work, we propose algebraic adversarial examples and study the conditions under which one can generate adversarial examples for integrated gradients. Algebraic adversarial examples provide a mathematically tractable approach to adversarial examples.
翻译:当解释被用于理解安全关键系统中神经网络的推理过程时,可解释性模型上的对抗攻击会产生严重后果。路径方法是一类易受对抗攻击的归因方法。对抗学习通常被表述为约束优化问题。在本研究中,我们提出代数对抗样本,并研究为积分梯度生成对抗样本的条件。代数对抗样本为对抗样本提供了一种数学上易于处理的研究方法。