Adversarial examples can be useful for identifying vulnerabilities in AI systems before they are deployed. In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other part of the environment. However, this does not take into account additional structure in the problem. In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities. We make two contributions. (1) We introduce white-box adversarial policies where an attacker observes both a target's internal state and the world state at each timestep. We formulate ways of using these policies to attack agents in 2-player games and text-generating language models. (2) We demonstrate that these policies can achieve higher initial and asymptotic performance against a target agent than black-box controls. Code is available at https://github.com/thestephencasper/lm_white_box_attacks
翻译:对抗样本可用于在AI系统部署前识别其漏洞。在强化学习(RL)中,可通过训练对抗性智能体最小化目标智能体奖励来开发对抗策略。先前研究关注此类攻击的黑盒版本,其中攻击者仅观察世界状态,将目标智能体视为环境中的其他部分。然而,这未充分利用问题中的额外结构化信息。本研究探索白盒对抗策略,证明访问目标智能体内部状态有助于识别其漏洞。我们做出两项贡献:(1)提出白盒对抗策略,攻击者可在每个时间步同时观察目标内部状态与世界状态。我们设计了利用此类策略攻击双人博弈智能体与文本生成语言模型的方法。(2)实验表明,相较于黑盒对照,这些策略在对抗目标智能体时能实现更高的初始性能与渐近性能。代码见https://github.com/thestephencasper/lm_white_box_attacks