Training a policy in a source domain for deployment in the target domain under a dynamics shift can be challenging, often resulting in performance degradation. Previous work tackles this challenge by training on the source domain with modified rewards derived by matching distributions between the source and the target optimal trajectories. However, pure modified rewards only ensure the behavior of the learned policy in the source domain resembles trajectories produced by the target optimal policies, which does not guarantee optimal performance when the learned policy is actually deployed to the target domain. In this work, we propose to utilize imitation learning to transfer the policy learned from the reward modification to the target domain so that the new policy can generate the same trajectories in the target domain. Our approach, Domain Adaptation and Reward Augmented Imitation Learning (DARAIL), utilizes the reward modification for domain adaptation and follows the general framework of generative adversarial imitation learning from observation (GAIfO) by applying a reward augmented estimator for the policy optimization step. Theoretically, we present an error bound for our method under a mild assumption regarding the dynamics shift to justify the motivation of our method. Empirically, our method outperforms the pure modified reward method without imitation learning and also outperforms other baselines in benchmark off-dynamics environments.
翻译:在动态变化下,为在目标域中部署而在源域中训练策略通常具有挑战性,常导致性能下降。先前研究通过修改奖励函数应对这一挑战,即在源域中训练时使用基于源域与目标域最优轨迹分布匹配推导出的修正奖励。然而,纯修正奖励仅能确保学习策略在源域中的行为与目标域最优策略产生的轨迹相似,这并不能保证该策略实际部署至目标域时获得最优性能。本工作中,我们提出利用模仿学习将基于奖励修正学习到的策略迁移至目标域,使新策略能在目标域中生成相同轨迹。我们提出的方法——域适应与奖励增强模仿学习(DARAIL)——利用奖励修正进行域适应,并遵循基于观测的生成对抗模仿学习(GAIfO)总体框架,在策略优化步骤中应用奖励增强估计器。理论上,我们在关于动态变化的温和假设下给出了方法的误差界,以论证本方法的动机。实证结果表明,我们的方法在基准离动态环境中优于未使用模仿学习的纯修正奖励方法,同时也优于其他基线方法。