Offline reinforcement learning (RL) provides a promising approach to avoid costly online interaction with the real environment. However, the performance of offline RL highly depends on the quality of the datasets, which may cause extrapolation error in the learning process. In many robotic applications, an inaccurate simulator is often available. However, the data directly collected from the inaccurate simulator cannot be directly used in offline RL due to the well-known exploration-exploitation dilemma and the dynamic gap between inaccurate simulation and the real environment. To address these issues, we propose a novel approach to combine the offline dataset and the inaccurate simulation data in a better manner. Specifically, we pre-train a generative adversarial network (GAN) model to fit the state distribution of the offline dataset. Given this, we collect data from the inaccurate simulator starting from the distribution provided by the generator and reweight the simulated data using the discriminator. Our experimental results in the D4RL benchmark and a real-world manipulation task confirm that our method can benefit more from both inaccurate simulator and limited offline datasets to achieve better performance than the state-of-the-art methods.
翻译:离线强化学习(RL)提供了一种避免与真实环境进行昂贵在线交互的有前景方法。然而,离线RL的性能高度依赖于数据集的质量,这可能导致学习过程中的外推误差。在许多机器人应用中,通常存在不精确的模拟器。但由于众所周知的探索-利用困境以及不精确模拟与真实环境之间的动态差距,直接从该模拟器收集的数据无法直接用于离线RL。为解决这些问题,我们提出了一种新方法,以更优方式结合离线数据集与不精确模拟数据。具体而言,我们预训练一个生成对抗网络(GAN)模型来拟合离线数据集的状态分布。基于此,我们从生成器提供的分布出发,利用不精确模拟器收集数据,并通过判别器对模拟数据进行重新加权。在D4RL基准测试及一项真实世界的操控任务中的实验结果证实,我们的方法能更有效地利用不精确模拟器和有限的离线数据集,实现优于现有最先进方法的性能。