Safe reinforcement learning (RL) trains a constraint satisfaction policy by interacting with the environment. We aim to tackle a more challenging problem: learning a safe policy from an offline dataset. We study the offline safe RL problem from a novel multi-objective optimization perspective and propose the $\epsilon$-reducible concept to characterize problem difficulties. The inherent trade-offs between safety and task performance inspire us to propose the constrained decision transformer (CDT) approach, which can dynamically adjust the trade-offs during deployment. Extensive experiments show the advantages of the proposed method in learning an adaptive, safe, robust, and high-reward policy. CDT outperforms its variants and strong offline safe RL baselines by a large margin with the same hyperparameters across all tasks, while keeping the zero-shot adaptation capability to different constraint thresholds, making our approach more suitable for real-world RL under constraints. The code is available at https://github.com/liuzuxin/OSRL.
翻译:安全强化学习通过与环境交互来训练满足约束的策略。我们旨在解决一个更具挑战性的问题:从离线数据集中学习安全策略。我们从新颖的多目标优化视角研究离线安全强化学习问题,并提出$\epsilon$-可约简概念来刻画问题难度。安全与任务性能之间的固有权衡启发我们提出约束决策变换器方法,该方法可在部署过程中动态调整权衡。大量实验表明,所提方法在学习自适应、安全、鲁棒且高回报策略方面具有优势。在跨所有任务使用相同超参数的情况下,CDT以较大幅度优于其变体及强离线安全强化学习基线,同时保持对不同约束阈值的零样本自适应能力,这使得我们的方法更适合于带约束的现实世界强化学习应用。代码见https://github.com/liuzuxin/OSRL。