We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize the policy using data from low-shift regions, limiting exploration of high-reward states in the target domain that do not fall within these regions. Consequently, such methods often fail when the dynamics shift is significant or the optimal trajectories lie outside the low-shift regions. To overcome this limitation, we propose MOBODY, a Model-Based Off-Dynamics Offline RL algorithm that optimizes a policy using learned target dynamics transitions to explore the target domain, rather than only being trained with the low dynamics-shift transitions. For the dynamics learning, built on the observation that achieving the same next state requires taking different actions in different domains, MOBODY employs separate action encoders for each domain to encode different actions to the shared latent space while sharing a unified representation of states and a common transition function. We further introduce a target Q-weighted behavior cloning loss in policy optimization to avoid out-of-distribution actions, which push the policy toward actions with high target-domain Q-values, rather than high source domain Q-values or uniformly imitating all actions in the offline dataset. We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with especially pronounced improvements in challenging scenarios where existing methods struggle.
翻译:我们研究不同动力学下的离线强化学习,其目标是在动力学不匹配的离线源数据和有限目标数据集中学习策略。现有方法要么惩罚奖励,要么丢弃发生在动力学转移较大的转移空间部分的源转移。因此,它们使用低转移区域的数据优化策略,限制了目标领域中不落在此区域内的高奖励状态的探索。因此,当动力学转移显著或最优轨迹位于低转移区域之外时,此类方法常常失效。为解决此限制,我们提出MOBODY,一种基于模型的不同动力学离线强化学习算法,该算法使用学习到的目标动力学转移来探索目标域以优化策略,而不仅限于使用低动力学转移的转移进行训练。在动力学学习方面,基于观察到在不同域中达到相同下一状态需要采取不同动作,MOBODY为每个域使用独立的动作编码器,将不同动作编码到共享的潜在空间,同时共享统一的状态表示和公共转移函数。我们进一步在策略优化中引入目标Q加权行为克隆损失,以避免分布外动作,该损失推动策略偏向具有高目标域Q值的动作,而非高源域Q值或统一模仿离线数据集中的所有动作。我们在广泛的MuJoCo和Adroit基准上评估MOBODY,结果表明其优于最先进的不同动力学强化学习基线以及基于不同动力学学习基线的策略学习方法,在现有方法难以应对的挑战场景中改进尤为显著。