Whereas reinforcement learning has been applied with success to a range of robotic control problems in complex, uncertain environments, reliance on extensive data - typically sourced from simulation environments - limits real-world deployment due to the domain gap between simulated and physical systems, coupled with limited real-world sample availability. We propose a novel method for sim-to-real transfer of reinforcement learning policies, based on a reinterpretation of neural style transfer from image processing to synthesise novel training data from unpaired unlabelled real world datasets. We employ a variational autoencoder to jointly learn self-supervised feature representations for style transfer and generate weakly paired source-target trajectories to improve physical realism of synthesised trajectories. We demonstrate the application of our approach based on the case study of robot cutting of unknown materials. Compared to baseline methods, including our previous work, CycleGAN, and conditional variational autoencoder-based time series translation, our approach achieves improved task completion time and behavioural stability with minimal real-world data. Our framework demonstrates robustness to geometric and material variation, and highlights the feasibility of policy adaptation in challenging contact-rich tasks where real-world reward information is unavailable.
翻译:尽管强化学习已成功应用于复杂不确定环境中的一系列机器人控制问题,但由于模拟系统与物理系统之间存在领域差距,加之真实世界样本获取有限,对大量数据(通常来源于仿真环境)的依赖限制了其在现实世界中的部署。我们提出了一种新颖的强化学习策略模拟到真实迁移方法,该方法通过将图像处理中的神经风格迁移技术重新阐释,从未配对、无标签的真实世界数据集中合成新的训练数据。我们采用变分自编码器联合学习用于风格迁移的自监督特征表示,并生成弱配对的源-目标轨迹以提高合成轨迹的物理真实性。我们以机器人切割未知材料的案例研究展示了所提方法的应用。与基线方法(包括我们先前的工作、CycleGAN以及基于条件变分自编码器的时间序列转换方法)相比,我们的方法在仅需极少真实世界数据的情况下,实现了更优的任务完成时间和行为稳定性。我们的框架展现出对几何与材料变化的鲁棒性,并凸显了在缺乏真实世界奖励信息的、接触密集的挑战性任务中进行策略适应的可行性。