To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.
翻译:为确保强化学习(RL)在实际系统中的有效性,必须保证其对噪声和对抗攻击具有鲁棒性。在对抗性强化学习中,外部攻击者能够操纵受害智能体与环境的交互过程。我们研究了在线操控攻击的完整类别,包括:(i)状态攻击;(ii)观测攻击(作为感知状态攻击的泛化形式);(iii)动作攻击;(iv)奖励攻击。我们证明,攻击者设计隐蔽攻击以最大化自身期望奖励(通常对应最小化受害者的价值)的问题,可以由一个称为元马尔可夫决策过程(meta-MDP)的马尔可夫决策过程(MDP)描述——该过程并非真实环境,而是由受攻击交互导出的更高层次环境。我们证明,攻击者可通过标准RL技术以多项式时间规划或多项式样本复杂度学习推导出最优攻击。我们认为,受害者的最优防御策略可作为随机斯塔克尔伯格博弈的解进行计算,并可进一步简化为部分可观测的回合制随机博弈(POTBSG)。攻击者与受害者任何一方偏离其各自最优策略均无法获益,因此此类解具有真正的鲁棒性。尽管防御问题属于NP困难问题,我们证明在许多场景下,最优马尔可夫防御策略仍可通过多项式时间(或样本复杂度)计算(学习)得出。