Safe reinforcement learning (RL) offers advanced solutions to constrained optimal control problems. Existing studies in safe RL implicitly assume continuity in policy functions, where policies map states to actions in a smooth, uninterrupted manner; however, our research finds that in some scenarios, the feasible policy should be discontinuous or multi-valued, interpolating between discontinuous local optima can inevitably lead to constraint violations. We are the first to identify the generating mechanism of such a phenomenon, and employ topological analysis to rigorously prove the existence of policy bifurcation in safe RL, which corresponds to the contractibility of the reachable tuple. Our theorem reveals that in scenarios where the obstacle-free state space is non-simply connected, a feasible policy is required to be bifurcated, meaning its output action needs to change abruptly in response to the varying state. To train such a bifurcated policy, we propose a safe RL algorithm called multimodal policy optimization (MUPO), which utilizes a Gaussian mixture distribution as the policy output. The bifurcated behavior can be achieved by selecting the Gaussian component with the highest mixing coefficient. Besides, MUPO also integrates spectral normalization and forward KL divergence to enhance the policy's capability of exploring different modes. Experiments with vehicle control tasks show that our algorithm successfully learns the bifurcated policy and ensures satisfying safety, while a continuous policy suffers from inevitable constraint violations.
翻译:安全强化学习为约束最优控制问题提供了先进的解决方案。现有安全强化学习研究默认假设策略函数具有连续性,即策略以平滑、无间断的方式将状态映射至动作;然而,本研究发现,在某些场景下,可行策略应具备非连续或多值特性,对非连续局部最优解进行插值将不可避免地导致约束违反。我们首次识别了这一现象的生成机制,并运用拓扑分析严格证明了安全强化学习中策略分岔的存在性,该分岔对应于可达元组的可收缩性。所提定理表明:在无障碍状态空间非单连通的场景中,可行策略必须呈现分岔特性,即其输出动作需随状态变化发生突变。为训练此类分岔策略,我们提出称为多模态策略优化的安全强化学习算法,该算法采用高斯混合分布作为策略输出,通过选取混合系数最高的高斯分量实现分岔行为。此外,MUPO还集成了谱归一化与前向KL散度以增强策略探索不同模态的能力。车辆控制任务实验表明,本算法成功习得分岔策略并确保安全性能,而连续策略则不可避免地出现约束违反。