Generalizing policies across different domains with dynamics mismatch poses a significant challenge in reinforcement learning. For example, a robot learns the policy in a simulator, but when it is deployed in the real world, the dynamics of the environment may be different. Given the source and target domain with dynamics mismatch, we consider the online dynamics adaptation problem, in which case the agent can access sufficient source domain data while online interactions with the target domain are limited. Existing research has attempted to solve the problem from the dynamics discrepancy perspective. In this work, we reveal the limitations of these methods and explore the problem from the value difference perspective via a novel insight on the value consistency across domains. Specifically, we present the Value-Guided Data Filtering (VGDF) algorithm, which selectively shares transitions from the source domain based on the proximity of paired value targets across the two domains. Empirical results on various environments with kinematic and morphology shifts demonstrate that our method achieves superior performance compared to prior approaches.
翻译:强化学习中,不同动态特性的域间策略泛化是一项关键挑战。例如,机器人在仿真环境中习得策略,但在真实世界部署时,环境动态特性可能产生差异。针对存在动态特性差异的源域与目标域,我们研究了在线动态自适应问题:智能体可获取充足的源域数据,但与目标域的在线交互受到限制。现有研究尝试从动态差异视角解决该问题,本研究揭示了这些方法的局限性,并通过跨域价值一致性的新视角,从价值差异角度探索该问题。具体而言,我们提出价值引导数据过滤(VGDF)算法,该算法基于源域与目标域中配对价值目标的接近程度,选择性共享源域的状态转移样本。在具有运动学与形态学偏移的多种环境中的实验结果表明,相较于既有方法,本方法取得了更优越的性能。