Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL's ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.
翻译:对机器学习模型的攻击已在无状态优化框架下得到广泛研究。本文展示了一个强化学习智能体如何学习生成对抗样本的新型攻击算法。与传统对抗机器学习方法独立生成对抗样本不同,基于强化学习的方法能够保留并利用历史攻击经验来提升后续攻击的有效性和高效性。我们将对抗样本生成过程建模为马尔可夫决策过程,并评估强化学习在以下两方面的能力:(a) 学习有效且高效的攻击策略;(b) 与当前最优对抗机器学习方法竞争。在两个图像分类基准测试中,从训练初期到末期,智能体将攻击成功率提升了最高13.2%,每次攻击所需的受害者模型查询次数平均减少了16.9%。在与当前最优图像攻击方法的直接对比中,本方法使攻击者在训练后针对未见输入生成对抗样本的成功率提高17%。从安全角度来看,本工作揭示了一种全新的攻击向量,即利用强化学习训练能够高效大规模攻击机器学习模型的智能体。