Decision Transformer (DT), as one of the representative Reinforcement Learning via Supervised Learning (RvS) methods, has achieved strong performance in offline learning tasks by leveraging the powerful Transformer architecture for sequential decision-making. However, in adversarial environments, these methods can be non-robust, since the return is dependent on the strategies of both the decision-maker and adversary. Training a probabilistic model conditioned on observed return to predict action can fail to generalize, as the trajectories that achieve a return in the dataset might have done so due to a suboptimal behavior adversary. To address this, we propose a worst-case-aware RvS algorithm, the Adversarially Robust Decision Transformer (ARDT), which learns and conditions the policy on in-sample minimax returns-to-go. ARDT aligns the target return with the worst-case return learned through minimax expectile regression, thereby enhancing robustness against powerful test-time adversaries. In experiments conducted on sequential games with full data coverage, ARDT can generate a maximin (Nash Equilibrium) strategy, the solution with the largest adversarial robustness. In large-scale sequential games and continuous adversarial RL environments with partial data coverage, ARDT demonstrates significantly superior robustness to powerful test-time adversaries and attains higher worst-case returns compared to contemporary DT methods.
翻译:决策Transformer(DT)作为监督学习强化学习(RvS)方法的代表性算法之一,通过利用强大的Transformer架构进行序列决策,在离线学习任务中取得了优异的性能。然而,在对抗性环境中,由于回报同时取决于决策者和对手的策略,这类方法可能缺乏鲁棒性。基于观测回报训练概率模型以预测动作的方法可能无法泛化,因为数据集中实现特定回报的轨迹可能是由于对手采取了次优行为所致。为解决这一问题,我们提出了一种考虑最坏情况的RvS算法——对抗性鲁棒决策Transformer(ARDT),该算法学习并基于样本内极小化极大剩余回报来构建策略。ARDT通过极小化极大期望回归学习最坏情况回报,并将目标回报与之对齐,从而增强了对强大测试时对手的鲁棒性。在具有完全数据覆盖的序列博弈实验中,ARDT能够生成极大极小(纳什均衡)策略,即具有最大对抗鲁棒性的解。在具有部分数据覆盖的大规模序列博弈和连续对抗性强化学习环境中,与当前主流DT方法相比,ARDT展现出对强大测试时对手显著更优的鲁棒性,并获得了更高的最坏情况回报。