Constrained MDPs (CMDPs) are a widely adopted framework for incorporating safety into RL agents; however, the framework does not support risk-sensitive constraints. This can be problematic: For example, CMDPs allow for optimal solutions that, in order to satisfy the risk-neutral constraints, mix infrequent catastrophic behaviors and frequent, overly conservative ones. Moreover, prior empirical results suggest that enforcing stricter, risk-sensitive constraints can improve performance even under risk-neutral evaluation. The natural framework to incorporate risk-sensitive constraints is utility-constrained MDPs (UCMDPs), but no practical solutions for this problem existed. In this work, we introduce a simple yet powerful methodology for UCMDPs and constrained RL. Besides allowing for risk-sensitive constraints, our framework does not require us to fix constraint limits in advance of training the agent, provided that a sensible range is known. This increases policy flexibility and, in practice, allows for adjustments to these limits at no extra training cost. Besides benefiting from the generality of the framework, our agent shows strong performance in practice, consistently matching or outperforming existing baselines in several Safety Gymnasium benchmark tasks.
翻译:约束马尔可夫决策过程(CMDPs)是将安全性纳入强化学习智能体的广泛采用框架,然而该框架不支持风险敏感约束。这可能带来问题:例如,CMDPs允许这样的最优解——为满足风险中性约束,算法会混合罕见的灾难性行为和频繁的过度保守行为。此外,先前的实证结果表明,即使在风险中性评估下,执行更严格的风险敏感约束也能提升性能。自然适用于纳入风险敏感约束的框架是效用约束马尔可夫决策过程(UCMDPs),但此前尚无该问题的实用解决方案。在本工作中,我们提出了一种简单而强大的UCMDPs及约束强化学习方法。除支持风险敏感约束外,只要已知合理范围,我们的框架无需在训练智能体前预先固定约束限值。这增加了策略灵活性,在实际应用中允许在不增加额外训练成本的情况下调整这些限值。除了得益于该框架的通用性,我们的智能体在实践中展现出强劲性能,在多个Safety Gymnasium基准任务中始终匹配或超越现有基线。