We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained DPT (AT-DPT). Our method simultaneously trains a population of attackers to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that AT-DPT significantly outperforms them in bandit settings under a learned attacker, and generalizes to more complex environments such as adaptive attackers and MDPs. It shows promise in ICRL as a meta-RL approach to learning effective corruption-robust algorithms.
翻译:我们研究了上下文强化学习在腐败环境下的鲁棒性,重点关注决策预训练变换器(DPT,Lee等人,2023)。针对针对DPT的奖赏投毒攻击挑战,我们提出了一种新颖的对抗训练框架,称为对抗训练DPT(AT-DPT)。我们的方法同时训练一组攻击者通过投毒环境奖赏来最小化DPT的真实奖赏,以及一个DPT模型从投毒数据中推断最优动作。我们在对抗标准赌博机算法(包括专为处理奖赏污染而设计的鲁棒基线)的背景下评估了我们方法的有效性。结果表明,在学习的攻击者设定下,AT-DPT在赌博机环境中显著优于这些基线,并且能够泛化到更复杂的环境中,如自适应攻击者和马尔可夫决策过程。作为元强化学习方法,AT-DPT在学习有效鲁棒腐败算法方面展现出在上下文强化学习中的潜力。