Simulation-to-Reality Reinforcement Learning (Sim-to-Real RL) seeks to use simulations to minimize the need for extensive real-world interactions. Specifically, in the few-shot off-dynamics setting, the goal is to acquire a simulator-based policy despite a dynamics mismatch that can be effectively transferred to the real-world using only a handful of real-world transitions. In this context, conventional RL agents tend to exploit simulation inaccuracies resulting in policies that excel in the simulator but underperform in the real environment. To address this challenge, we introduce a novel approach that incorporates a penalty to constrain the trajectories induced by the simulator-trained policy inspired by recent advances in Imitation Learning and Trust Region based RL algorithms. We evaluate our method across various environments representing diverse Sim-to-Real conditions, where access to the real environment is extremely limited. These experiments include high-dimensional systems relevant to real-world applications. Across most tested scenarios, our proposed method demonstrates performance improvements compared to existing baselines.
翻译:仿真到现实强化学习(Sim-to-Real RL)旨在利用仿真来减少对大量真实环境交互的需求。具体而言,在少样本动力学偏移场景中,目标是在存在动力学不匹配的情况下获取基于仿真器的策略,并仅通过少量真实环境转换即可有效迁移至现实世界。在此背景下,传统强化学习代理倾向于利用仿真不精确性,导致策略在仿真器中表现优异但在真实环境中表现欠佳。为应对这一挑战,我们提出了一种新颖方法,该方法受最近模仿学习与基于置信域的强化学习算法进展启发,通过引入惩罚项来约束仿真训练策略所诱导的轨迹。我们在代表不同仿真到现实条件的多种环境中评估了该方法,这些环境中真实环境的访问极其有限。这些实验包括与真实世界应用相关的高维系统。在大多数测试场景中,与现有基线方法相比,我们提出的方法展现了性能提升。