A decoupled diffusion planner that adapts to changing cost limits by using cost-conditioned generation for safety and reward gradients for performance

5 月 4 日

翻译：自适应成本极限的解耦扩散规划器：结合成本条件生成与奖励梯度实现安全与性能

Rufeng Chen,Zhaofan Zhang,Zhejiang Yang,Hechang Chen,Sihong Xie

Offline safe reinforcement learning often requires policies to adapt at deployment time to safety budgets that vary across episodes or change within a single episode. While diffusion-based planners enable flexible trajectory generation, existing guidance schemes often treat reward improvement and constraint satisfaction as competing gradient objectives, which can lead to unreliable safety compliance under cost limits. We reinterpret adaptive safe trajectory generation as sampling from a constrained trajectory distribution, where the budget restricts the trajectory region, and reward shapes preferences within that region. This perspective motivates Safe Decoupled Guidance Diffusion (SDGD), which conditions classifier-free guidance on the cost limit to bias sampling toward trajectories satisfying the specified limit, while using reward-gradient guidance to refine trajectories for higher return. Because direct reward guidance can increase return while also steering samples toward trajectories with higher cumulative cost, we introduce Feasible Trajectory Relabeling (FTR) to reshape reward targets and discourage such directions. We further provide a first-order sampling-time analysis showing that FTR suppresses reward-induced cost drift under a prefix-restorative alignment condition. Extensive evaluations on the DSRL benchmark show that SDGD achieves the strongest safety compliance among baselines, satisfying the constraint on 94.7% of tasks (36/38), while obtaining the highest reward among safe methods on 21 tasks.

翻译：离线安全强化学习通常要求策略在部署时能够适应不同回合间或同一回合内变化的成本约束。尽管基于扩散的规划器支持灵活轨迹生成，现有引导方案常将奖励提升与约束满足视为竞争性梯度目标，这可能导致在成本限制下安全性不可靠。我们将自适应安全轨迹生成重新解释为从约束轨迹分布中采样，其中预算限制轨迹区域，而奖励在该区域内塑造偏好。这一视角催生了安全解耦引导扩散（SDGD），该方法将无分类器引导建立在成本限制上，使采样偏向满足指定限制的轨迹，同时利用奖励梯度引导来优化轨迹以获得更高回报。由于直接奖励引导虽能提升回报，但也会将样本引向累积成本更高的轨迹，我们引入了可行轨迹重标记（FTR）来重塑奖励目标并抑制此类方向。我们进一步提供了首阶采样时间分析，表明在前缀恢复对齐条件下，FTR能抑制由奖励引发的成本偏移。在DSRL基准上的广泛评估显示，SDGD在基线中实现了最强的安全性，在94.7%的任务（36/38）中满足约束，同时在21个任务中获得了安全方法中的最高奖励。