Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. We study the offline safe RL problem from a novel multi-objective optimization perspective and propose the $\epsilon$-reducible concept to characterize problem difficulties. The inherent trade-offs between safety and task performance inspire us to propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment. Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints.
翻译:安全强化学习通过与环境的交互训练满足约束的策略。我们旨在解决更具挑战性的问题:从离线数据集中学习安全策略。我们从新颖的多目标优化角度研究离线安全强化学习问题,并提出$\epsilon$-可约简概念来刻画问题难度。安全性与任务性能之间的固有权衡启发我们提出约束决策变换器方法,该方法可在部署过程中动态调整权衡。大量实验表明,所提方法在学习自适应、安全、鲁棒且高回报策略方面具有优势。在保持对不同约束阈值的零样本自适应能力的同时,CDT在所有任务中使用相同超参数均大幅优于其变体及强离线安全强化学习基线,使我们的方法更适用于现实世界中受约束的强化学习场景。