To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.
翻译:为确保强化学习(RL)在实际系统中的实用性,必须保证其对噪声和对抗性攻击具有鲁棒性。在对抗性强化学习中,外部攻击者能够操纵受害智能体与环境的交互过程。本文研究了完整的在线操纵攻击类别,包括:(i)状态攻击,(ii)观测攻击(作为感知状态攻击的广义形式),(iii)动作攻击,以及(iv)奖励攻击。我们证明攻击者设计隐蔽攻击以最大化自身期望奖励(通常对应最小化受害智能体价值)的问题,可由一个马尔可夫决策过程(MDP)刻画——我们称之为元MDP,因为它并非真实环境,而是受攻击交互所诱导的高层环境。研究表明,攻击者可通过多项式时间规划或采用标准RL技术以多项式样本复杂度学习,从而推导出最优攻击策略。我们论证受害智能体的最优防御策略可通过求解随机Stackelberg博弈获得,该问题可进一步简化为部分可观测回合制随机博弈(POTBSG)。攻击者与受害智能体均不会偏离各自的最优策略,因此此类解具有真正的鲁棒性。尽管防御问题属于NP难问题,但我们证明在许多场景下,最优马尔可夫防御策略可在多项式时间(样本复杂度)内计算(学习)得出。