Adversarial training in reinforcement learning (RL) is challenging because perturbations cascade through trajectories and compound over time, making fixed-strength attacks either overly destructive or too conservative. We propose reward-preserving attacks, which adapt adversarial strength so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, perturbation magnitudes $η$ are selected dynamically, using a learned critic $Q((s,a),η)$ that estimates the expected return of $α$-reward-preserving rollouts. For intermediate values of $α$, this adaptive training yields policies that are robust across a wide range of perturbation magnitudes while preserving nominal performance, outperforming fixed-radius and uniformly sampled-radius adversarial training.
翻译:强化学习中的对抗训练面临挑战,因为扰动会沿轨迹级联传播并随时间累积,导致固定强度的攻击要么破坏性过强,要么过于保守。本文提出奖励保持攻击方法,通过动态调整对抗强度,使得在任意状态下仍能保持名义回报与最差情形回报差距的$α$比例可达。在深度强化学习中,我们利用学习得到的评判器$Q((s,a),η)$动态选择扰动幅度$η$,该评判器用于估计$α$奖励保持型轨迹的期望回报。对于中间值的$α$,这种自适应训练产生的策略能在广泛扰动幅度范围内保持鲁棒性,同时维持名义性能,其表现优于固定半径和均匀采样半径的对抗训练方法。