Evaluating deep reinforcement learning (DRL) agents against targeted behavior attacks is critical for assessing their robustness. These attacks aim to manipulate the victim into specific behaviors that align with the attacker's objectives, often bypassing traditional reward-based defenses. Prior methods have primarily focused on reducing cumulative rewards; however, rewards are typically too generic to capture complex safety requirements effectively. As a result, focusing solely on reward reduction can lead to suboptimal attack strategies, particularly in safety-critical scenarios where more precise behavior manipulation is needed. To address these challenges, we propose RAT, a method designed for universal, targeted behavior attacks. RAT trains an intention policy that is explicitly aligned with human preferences, serving as a precise behavioral target for the adversary. Concurrently, an adversary manipulates the victim's policy to follow this target behavior. To enhance the effectiveness of these attacks, RAT dynamically adjusts the state occupancy measure within the replay buffer, allowing for more controlled and effective behavior manipulation. Our empirical results on robotic simulation tasks demonstrate that RAT outperforms existing adversarial attack algorithms in inducing specific behaviors. Additionally, RAT shows promise in improving agent robustness, leading to more resilient policies. We further validate RAT by guiding Decision Transformer agents to adopt behaviors aligned with human preferences in various MuJoCo tasks, demonstrating its effectiveness across diverse tasks.
翻译:评估深度强化学习(DRL)智能体抵御目标行为攻击的能力对于评估其鲁棒性至关重要。此类攻击旨在操控受害者执行与攻击者目标相符的特定行为,通常能够绕过传统的基于奖励的防御机制。现有方法主要侧重于降低累积奖励;然而,奖励通常过于泛化,难以有效捕捉复杂的安全需求。因此,仅关注奖励降低可能导致次优的攻击策略,尤其是在需要更精确行为操控的安全关键场景中。为应对这些挑战,我们提出了RAT,一种专为通用目标行为攻击设计的方法。RAT训练一个与人类偏好明确对齐的意图策略,作为对手的精确行为目标。同时,对手操控受害者的策略以遵循此目标行为。为增强攻击效果,RAT动态调整回放缓冲区中的状态占用测度,从而实现更可控、更有效的行为操控。我们在机器人仿真任务上的实验结果表明,RAT在诱导特定行为方面优于现有的对抗攻击算法。此外,RAT在提升智能体鲁棒性方面展现出潜力,能够产生更具韧性的策略。我们进一步通过引导Decision Transformer智能体在多种MuJoCo任务中采纳与人类偏好一致的行为来验证RAT,证明了其在多样化任务中的有效性。