Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach safe exploration through the lens of epistemic uncertainty, where the actor's sensitivity to parameter perturbations serves as a practical proxy for regions of high uncertainty. We propose Sharpness-Aware Policy Optimization (SHAPO), a sharpness-aware policy update rule that evaluates gradients at perturbed parameters, making policy updates pessimistic with respect to the actor's epistemic uncertainty. Analytically we show that this adjustment implicitly reweighs policy gradients, amplifying the influence of rare unsafe actions while tempering contributions from already safe ones, thereby biasing learning toward conservative behavior in under-explored regions. Across several continuous-control tasks, our method consistently improves both safety and task performance over existing baselines, significantly expanding their Pareto frontiers.
翻译:安全探索是在安全关键领域部署强化学习智能体的前提。本文从认知不确定性的视角探讨安全探索问题,其中智能体对参数扰动的敏感性可作为高不确定性区域的有效代理指标。我们提出锐度感知策略优化(SHAPO),这是一种通过评估扰动参数处的梯度来实现策略更新的锐度感知规则,使策略更新对智能体的认知不确定性保持悲观态度。分析表明,这种调整隐式地重新加权了策略梯度,在增强罕见不安全行为影响的同时抑制已有安全行为的贡献,从而将学习偏向于欠探索区域的保守行为。在多项连续控制任务中,我们的方法相较于现有基线持续提升了安全性与任务性能,显著扩展了其帕累托前沿。