Ensuring the safe exploration of reinforcement learning (RL) agents is critical for deployment in real-world systems. Yet existing approaches struggle to strike the right balance: methods that tightly enforce safety often cripple task performance, while those that prioritize reward leave safety constraints frequently violated, producing diffuse cost landscapes that flatten gradients and stall policy improvement. We introduce the Uncertain Safety Critic (USC), a novel approach that integrates uncertainty-aware modulation and refinement into critic training. By concentrating conservatism in uncertain and costly regions while preserving sharp gradients in safe areas, USC enables policies to achieve effective reward-safety trade-offs. Extensive experiments show that USC reduces safety violations by approximately 40% while maintaining competitive or higher rewards, and reduces the error between predicted and true cost gradients by approximately 83%, breaking the prevailing trade-off between safety and performance and paving the way for scalable safe RL.
翻译:确保强化学习(RL)智能体的安全探索对于在现实世界系统中的部署至关重要。然而,现有方法难以取得恰当的平衡:严格强制执行安全性的方法通常会严重损害任务性能,而优先考虑奖励的方法则常常违反安全约束,产生扩散的成本景观,从而平滑梯度并阻碍策略改进。我们引入了不确定性安全批评者(USC),这是一种新颖的方法,将不确定性感知调制和精细化集成到批评者训练中。通过将保守性集中在不确定和高成本区域,同时在安全区域保持锐利梯度,USC使策略能够实现有效的奖励-安全性权衡。大量实验表明,USC将安全违规减少了约40%,同时保持竞争性或更高的奖励,并将预测成本梯度与真实成本梯度之间的误差减少了约83%,打破了安全与性能之间普遍存在的权衡,为可扩展的安全RL铺平了道路。