In safe offline reinforcement learning (RL), the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical analysis, including bounds on policy performance and sample complexity. Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach.
翻译:在安全离线强化学习(RL)中,目标在于仅利用离线数据,开发一种在严格遵循安全约束的同时最大化累积奖励的策略。传统方法通常在平衡这些约束方面面临困难,导致性能下降或安全风险增加。我们通过一种新颖方法解决这些问题:首先,利用条件变分自编码器对潜在安全约束进行建模,从而学习一个保守的安全策略。随后,我们将此问题构建为一个约束奖励回报最大化问题,其中策略旨在优化奖励,同时遵守推断出的潜在安全约束。这是通过在潜在约束空间内,使用奖励-优势加权回归目标训练编码器来实现的。我们的方法得到了理论分析的支持,包括策略性能边界和样本复杂度分析。在基准数据集(包括具有挑战性的自动驾驶场景)上进行的大量实证评估表明,我们的方法不仅保持了安全合规性,而且在累积奖励优化方面表现出色,超越了现有方法。额外的可视化分析进一步揭示了我们方法的有效性及其内在机制。