We consider the Imitation Learning (IL) setup where expert data are not collected on the actual deployment environment but on a different version. To address the resulting distribution shift, we combine behavior cloning (BC) with a planner that is tasked to bring the agent back to states visited by the expert whenever the agent deviates from the demonstration distribution. The resulting algorithm, POIR, can be trained offline, and leverages online interactions to efficiently fine-tune its planner to improve performance over time. We test POIR on a variety of human-generated manipulation demonstrations in a realistic robotic manipulation simulator and show robustness of the learned policy to different initial state distributions and noisy dynamics.
翻译:我们考虑了一种模仿学习(IL)场景:专家数据并非在真实部署环境中收集,而是在不同的版本中获取。为应对由此产生的分布偏移,我们将行为克隆(BC)与规划器相结合。当智能体偏离演示分布时,该规划器负责将智能体引导回专家访问过的状态。所提出的算法POIR可离线训练,并利用在线交互高效微调其规划器,以随时间推移提升性能。我们在逼真的机器人操作模拟器上,基于多种人类生成的操作演示数据测试了POIR,结果表明习得策略对不同的初始状态分布和噪声动力学具有鲁棒性。