Explainable Artificial Intelligence (XAI) strategies play a crucial part in increasing the understanding and trustworthiness of neural networks. Nonetheless, these techniques could potentially generate misleading explanations. Blinding attacks can drastically alter a machine learning algorithm's prediction and explanation, providing misleading information by adding visually unnoticeable artifacts into the input, while maintaining the model's accuracy. It poses a serious challenge in ensuring the reliability of XAI methods. To ensure the reliability of XAI methods poses a real challenge, we leverage statistical analysis to highlight the changes in CNN weights within a CNN following blinding attacks. We introduce a method specifically designed to limit the effectiveness of such attacks during the evaluation phase, avoiding the need for extra training. The method we suggest defences against most modern explanation-aware adversarial attacks, achieving an approximate decrease of ~99\% in the Attack Success Rate (ASR) and a ~91\% reduction in the Mean Square Error (MSE) between the original explanation and the defended (post-attack) explanation across three unique types of attacks.
翻译:可解释人工智能(XAI)策略在提升神经网络的可理解性与可信度方面具有关键作用。然而,这些技术可能生成具有误导性的解释。盲目攻击能显著改变机器学习算法的预测结果与解释机制,通过向输入中添加视觉不可察觉的伪影提供误导信息,同时维持模型原有精度。这给确保XAI方法的可靠性带来了严峻挑战。我们采用统计分析方法,揭示了卷积神经网络(CNN)在遭受盲目攻击后权重的变化特征,并提出了一种专用于在评估阶段限制此类攻击有效性的方法,无需额外训练。所提方法能抵御当今大多数可解释感知型对抗攻击,在三种独特攻击类型下,使攻击成功率(ASR)平均降低约99%,并使原始解释与防御后(攻击后)解释之间的均方误差(MSE)减少约91%。