Off-dynamics offline reinforcement learning seeks to learn a target-domain policy from a large source dataset and a limited target dataset under mismatched transition dynamics. Existing approaches such as reward augmentation and data filtering are constrained to the source dataset and cannot synthesize new target behavior to improve coverage beyond the collected source trajectories. While recent model-based methods attempt to address this by learning target-aware dynamics, the generated experience is constructed only at the transition level, which leads to accumulated errors over long horizons. These limitations necessitate a shift toward trajectory-level generation for off-dynamics offline RL. We propose CEDGE, a Cross-domain Energy-guided Diffusion GEneration framework. CEDGE trains a trajectory diffusion model on source-domain trajectories and adapts the generated samples to the target domain through energy guidance. This guidance is derived by minimizing the distribution mismatch between the source and desired target-domain trajectories and is decomposed into return, domain, and behavior energy components. The resulting energy-guided trajectories are useful both for direct planning and as synthetic data for policy learning. Since target adaptation is achieved via energy guidance rather than retraining the diffusion model, CEDGE can be efficiently adapted to new target dynamics compared to previous methods. Experiments on the ODRL benchmark demonstrate that trajectory-level energy-guided generation improves diffusion planning under dynamics shifts and produces synthetic data that improves downstream target policy learning.
翻译:跨域离线强化学习旨在从大规模源数据集和有限目标数据集,在转移动力学不匹配的情况下学习目标域策略。现有方法(如奖励增强和数据过滤)受限于源数据集,无法合成新的目标行为以改善超出收集源轨迹的覆盖范围。虽然近期基于模型的方法试图通过学习目标感知动力学来解决此问题,但生成的经验仅在转移层面构建,导致长时域下的累积误差。这些局限性要求转向轨迹层面生成以解决动力学偏移下的离线强化学习。我们提出CEDGE——跨域能量引导扩散生成框架。CEDGE在源域轨迹上训练轨迹扩散模型,并通过能量引导将生成样本适应到目标域。该引导通过最小化源域与期望目标域轨迹之间的分布不匹配推导得出,并分解为回报、域和行为三个能量分量。由此产生的能量引导轨迹可用于直接规划,也可作为策略学习的合成数据。由于目标适应通过能量引导而非重新训练扩散模型实现,与以往方法相比,CEDGE可高效适应新的目标动力学。在ODRL基准上的实验表明,轨迹层面的能量引导生成改善了动力学偏移下的扩散规划,并产生了能提升下游目标策略学习的合成数据。