Deep learning has been rapidly employed in many applications revolutionizing many industries, but it is known to be vulnerable to adversarial attacks. Such attacks pose a serious threat to deep learning-based systems compromising their integrity, reliability, and trust. Interpretable Deep Learning Systems (IDLSes) are designed to make the system more transparent and explainable, but they are also shown to be susceptible to attacks. In this work, we propose a novel microbial genetic algorithm-based black-box attack against IDLSes that requires no prior knowledge of the target model and its interpretation model. The proposed attack is a query-efficient approach that combines transfer-based and score-based methods, making it a powerful tool to unveil IDLS vulnerabilities. Our experiments of the attack show high attack success rates using adversarial examples with attribution maps that are highly similar to those of benign samples which makes it difficult to detect even by human analysts. Our results highlight the need for improved IDLS security to ensure their practical reliability.
翻译:深度学习已迅速应用于众多领域并推动诸多行业变革,但众所周知其对对抗性攻击具有脆弱性。此类攻击严重威胁基于深度学习的系统,破坏其完整性、可靠性与可信度。可解释深度学习系统(IDLSes)旨在提升系统透明度与可解释性,但同样被证明易受攻击。本研究提出一种基于微生物遗传算法的新型黑盒攻击方法,针对IDLSes无需预先掌握目标模型及其解释模型的任何先验知识。该攻击结合了迁移型与得分型方法的查询高效技术,成为揭示IDLS漏洞的有力工具。实验表明,该攻击通过生成与良性样本归因图高度相似的对抗性样本,实现了高攻击成功率,即使人类分析师也难以察觉。研究结果强调,需提升IDLS安全性以保障其实践可靠性。