Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning. The common approach of using additive-cost constraints and dual variables often hinders off-policy scalability. We propose a Control as Inference formulation based on stochastic decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation. This yields survival-weighted objectives that remain replay-compatible for off-policy actor-critic learning. We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct optimization structures that lead to SAC/MPO-style policy improvement. Experiments demonstrate improved sample efficiency and favorable return-violation trade-offs on standard benchmarks. Moreover, MPO with virtual termination (VT-MPO) scales effectively to our high-dimensional musculoskeletal Hyfydy setup.
翻译:约束马尔可夫决策过程(CMDP)为强化学习中处理安全性和其他辅助目标等约束提供了原则性模型。使用加性成本约束和双变量的常见方法通常会阻碍离策略方法的可扩展性。我们提出了一种基于随机决策时域的“控制即推理”建模框架,其中约束违反会衰减奖励贡献,并通过状态-动作相关的持续性来缩短有效规划时域。这产生了生存加权目标,该目标在离策略的演员-评论家学习中仍保持重放兼容性。我们提出了两种违反语义:吸收终止与虚拟终止。它们共享相同的生存加权回报,但形成了不同的优化结构,从而导向SAC/MPO风格的政策改进。实验表明,在标准基准测试中,该方法提升了样本效率,并实现了更优的回报-违反权衡。此外,采用虚拟终止的MPO(VT-MPO)在我们的高维肌肉骨骼Hyfydy环境中展现出良好的可扩展性。