Reinforcement Learning (RL) can effectively learn complex policies. However, learning these policies often demands extensive trial-and-error interactions with the environment. In many real-world scenarios, this approach is not practical due to the high costs of data collection and safety concerns. As a result, a common strategy is to transfer a policy trained in a low-cost, rapid source simulator to a real-world target environment. However, this process poses challenges. Simulators, no matter how advanced, cannot perfectly replicate the intricacies of the real world, leading to dynamics discrepancies between the source and target environments. Past research posited that the source domain must encompass all possible target transitions, a condition we term full support. However, expecting full support is often unrealistic, especially in scenarios where significant dynamics discrepancies arise. In this paper, our emphasis shifts to addressing large dynamics mismatch adaptation. We move away from the stringent full support condition of earlier research, focusing instead on crafting an effective policy for the target domain. Our proposed approach is simple but effective. It is anchored in the central concepts of the skewing and extension of source support towards target support to mitigate support deficiencies. Through comprehensive testing on a varied set of benchmarks, our method's efficacy stands out, showcasing notable improvements over previous techniques.
翻译:强化学习(RL)能够有效学习复杂策略,但这一过程通常需要与环境进行大量试错交互。在实际场景中,由于数据采集的高昂成本和安全性限制,该方法往往难以应用。因此,常见策略是将低成本、快速源模拟器中训练的策略迁移至真实目标环境。然而,该过程存在挑战:无论模拟器如何先进,都无法完美复现真实世界的复杂性,导致源环境与目标环境之间存在动力学差异。以往研究假设源域必须包含所有可能的目标转移,即"完全支持"条件。但考虑到显著动力学差异场景,完全支持往往不切实际。本文重点转向解决大规模动力学差异的适应问题,摒弃早期研究中严格的完全支持条件,转而聚焦于为目标域设计有效策略。我们提出的方法简单而有效,其核心思想是通过将源支持向目标支持倾斜与扩展来缓解支持不足问题。通过在多样化基准测试集上的全面验证,该方法展现出显著优越性,较以往技术取得明显改进。