In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other part of the environment. However, this does not take into account additional structure in the problem. In this work, we take inspiration from the literature on white-box attacks to train more effective adversarial policies. We study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities. We make two contributions. (1) We introduce white-box adversarial policies where an attacker observes both a target's internal state and the world state at each timestep. We formulate ways of using these policies to attack agents in 2-player games and text-generating language models. (2) We demonstrate that these policies can achieve higher initial and asymptotic performance against a target agent than black-box controls. Code is available at https://github.com/thestephencasper/lm_white_box_attacks
翻译:在强化学习(RL)中,可以通过训练一个对抗性智能体来最小化目标智能体的奖励,从而开发出对抗策略。先前的工作研究了这些攻击的黑盒版本,其中攻击者仅观察世界状态,并将目标智能体视为环境中的任何其他部分。然而,这并未考虑问题中额外的结构。在本工作中,我们借鉴白盒攻击文献的思路,训练更有效的对抗策略。我们研究白盒对抗策略,并表明访问目标智能体的内部状态有助于识别其脆弱性。我们做出两项贡献:(1)我们引入了白盒对抗策略,其中攻击者在每个时间步同时观察目标的内部状态和世界状态。我们制定了使用这些策略来攻击双人游戏中的智能体和文本生成语言模型的方法。(2)我们证明,与黑盒控制相比,这些策略能够针对目标智能体实现更高的初始性能和渐近性能。代码可在 https://github.com/thestephencasper/lm_white_box_attacks 获取。