Off-dynamics Reinforcement Learning (ODRL) seeks to transfer a policy from a source environment to a target environment characterized by distinct yet similar dynamics. In this context, traditional RL agents depend excessively on the dynamics of the source environment, resulting in the discovery of policies that excel in this environment but fail to provide reasonable performance in the target one. In the few-shot framework, a limited number of transitions from the target environment are introduced to facilitate a more effective transfer. Addressing this challenge, we propose an innovative approach inspired by recent advancements in Imitation Learning and conservative RL algorithms. The proposed method introduces a penalty to regulate the trajectories generated by the source-trained policy. We evaluate our method across various environments representing diverse off-dynamics conditions, where access to the target environment is extremely limited. These experiments include high-dimensional systems relevant to real-world applications. Across most tested scenarios, our proposed method demonstrates performance improvements compared to existing baselines.
翻译:离动态强化学习(ODRL)旨在将策略从源环境迁移至具有不同但相似动态特性的目标环境。在此背景下,传统强化学习智能体过度依赖源环境的动态特性,导致所发现的策略在源环境中表现优异,却在目标环境中无法提供合理性能。在少样本框架下,通过引入有限数量的目标环境转移样本以促进更有效的迁移。针对这一挑战,我们提出一种受模仿学习和保守强化学习算法最新进展启发的创新方法。该方法通过引入惩罚项来调控源环境训练策略生成的轨迹。我们在代表多种离动态条件的不同环境中评估了所提方法,其中对目标环境的访问权限极为有限。这些实验包括与现实应用相关的高维系统。在大多数测试场景中,与现有基线方法相比,我们提出的方法均表现出性能提升。