Off-dynamics Reinforcement Learning (ODRL) seeks to transfer a policy from a source environment to a target environment characterized by distinct yet similar dynamics. In this context, traditional RL agents depend excessively on the dynamics of the source environment, resulting in the discovery of policies that excel in this environment but fail to provide reasonable performance in the target one. In the few-shot framework, a limited number of transitions from the target environment are introduced to facilitate a more effective transfer. Addressing this challenge, we propose an innovative approach inspired by recent advancements in Imitation Learning and conservative RL algorithms. The proposed method introduces a penalty to regulate the trajectories generated by the source-trained policy. We evaluate our method across various environments representing diverse off-dynamics conditions, where access to the target environment is extremely limited. These experiments include high-dimensional systems relevant to real-world applications. Across most tested scenarios, our proposed method demonstrates performance improvements compared to existing baselines.
翻译:非等动力学强化学习旨在将策略从源环境迁移到具有不同但相似动力学的目标环境。在此背景下,传统强化学习代理过度依赖源环境的动力学,导致发现的策略在源环境中表现优异,但在目标环境中无法达到合理性能。在小样本框架下,引入少量来自目标环境的转移数据以促进更有效的迁移。针对这一挑战,我们提出了一种创新方法,该方法借鉴了模仿学习与保守强化学习算法的最新进展。所提方法通过引入惩罚项来约束源训练策略生成的轨迹。我们在代表不同非等动力学条件的多种环境中评估了该方法,这些环境中对目标环境的访问极为有限。实验包括与实际应用相关的高维系统。在大多数测试场景中,我们的方法相较于现有基线展现出性能提升。