Among the most insidious attacks on Reinforcement Learning (RL) solutions are training-time attacks (TTAs) that create loopholes and backdoors in the learned behaviour. Not limited to a simple disruption, constructive TTAs (C-TTAs) are now available, where the attacker forces a specific, target behaviour upon a training RL agent (victim). However, even state-of-the-art C-TTAs focus on target behaviours that could be naturally adopted by the victim if not for a particular feature of the environment dynamics, which C-TTAs exploit. In this work, we show that a C-TTA is possible even when the target behaviour is un-adoptable due to both environment dynamics as well as non-optimality with respect to the victim objective(s). To find efficient attacks in this context, we develop a specialised flavour of the DDPG algorithm, which we term gammaDDPG, that learns this stronger version of C-TTA. gammaDDPG dynamically alters the attack policy planning horizon based on the victim's current behaviour. This improves effort distribution throughout the attack timeline and reduces the effect of uncertainty the attacker has about the victim. To demonstrate the features of our method and better relate the results to prior research, we borrow a 3D grid domain from a state-of-the-art C-TTA for our experiments. Code is available at "bit.ly/github-rb-gDDPG".
翻译:在强化学习解决方案中,最具隐蔽性的攻击之一是训练时攻击(TTA),这类攻击会在学习行为中制造漏洞和后门。如今,构造性训练时攻击(C-TTA)已不限于简单的破坏行为,攻击者能够迫使训练中的强化学习代理(受害者)表现出特定的目标行为。然而,即使是当前最先进的C-TTA也聚焦于那些若不存在环境动态的特定特征(C-TTA利用该特征)便可被受害者自然采纳的目标行为。在本工作中,我们证明:即使目标行为因环境动态和相对受害者目标的非最优性而无法被采纳,C-TTA依然可以实现。为在此情境下寻找高效攻击方法,我们开发了一种DDPG算法的特殊变体——我们称之为gammaDDPG——该算法能够学习这种更强版本的C-TTA。gammaDDPG根据受害者当前行为动态调整攻击策略的规划时间范围,这优化了攻击时间轴上的资源分配,并降低了攻击者对受害者不确定性带来的影响。为展示我们方法的特点并便于与既有研究对比实验结果,我们采用了一个来自当前最先进C-TTA研究中的3D网格领域。代码见bit.ly/github-rb-gDDPG。