Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.
翻译:安全强化学习旨在通过在训练过程中减少约束违反来缓解探索引发的危险行为,同时保持任务性能。现有方法通常依赖单一策略联合优化奖励与安全性,这可能因目标冲突导致不稳定性;或采用外部安全过滤器覆盖动作,但这需要预知系统知识。本文提出一种模块化代价感知调节器,基于预测的约束违反程度缩放智能体动作,通过平滑动作调制而非硬性覆盖策略来保留探索能力。该调节器在最小化约束违反的同时避免对动作产生退化性抑制。本方法与SAC、TD3等离策略强化学习方法无缝集成,在稀疏代价的Safety Gym运动任务中实现了最优回报-代价比:相比先前方法,约束违反减少高达126倍,同时回报提升超过一个数量级。