Guided Imitation of Task and Motion Planning

from arxiv, 16 pages, 6 figures, 2 tables, submitted to Conference on Robot Learning 2021, to be published in Proceedings of Machine Learning Research

While modern policy optimization methods can do complex manipulation from sensory data, they struggle on problems with extended time horizons and multiple sub-goals. On the other hand, task and motion planning (TAMP) methods scale to long horizons but they are computationally expensive and need to precisely track world state. We propose a method that draws on the strength of both methods: we train a policy to imitate a TAMP solver's output. This produces a feed-forward policy that can accomplish multi-step tasks from sensory data. First, we build an asynchronous distributed TAMP solver that can produce supervision data fast enough for imitation learning. Then, we propose a hierarchical policy architecture that lets us use partially trained control policies to speed up the TAMP solver. In robotic manipulation tasks with 7-DoF joint control, the partially trained policies reduce the time needed for planning by a factor of up to 2.6. Among these tasks, we can learn a policy that solves the RoboSuite 4-object pick-place task 88% of the time from object pose observations and a policy that solves the RoboDesk 9-goal benchmark 79% of the time from RGB images (averaged across the 9 disparate tasks).

翻译：虽然现代政策优化方法可以对感官数据进行复杂的操纵,但是它们会因时间跨度延长和多个子目标的问题而挣扎。另一方面,任务和运动规划方法(TAMP)的规模会长视远视,但计算成本很高,需要精确跟踪世界状态。我们建议一种方法,利用两种方法的力量:我们训练一项政策,模仿TAMP求解器的输出。这产生了一种进化前进政策,能够完成感官数据多步任务。首先,我们建立一个分散的不同步的TAMP求解器,能够产生足够快的模拟学习的监督数据。然后,我们提出一个等级政策架构,让我们使用经过部分训练的控制政策加快TAMP求解答器的速度。在7-DoF联合控制的机器人操作任务中,经过部分训练的政策将规划所需的时间减少到2.6倍。其中,我们可以学习一项政策,解决 RoboSite 4-object 将88%的时间从对象提出观测结果和解决 RGB 9- blassk 9- basal 基准时间(RGB 9-bosk bas- bas-bal legal grational lagal gradual gradual) graphal lax imx im) ty timedudududududududustress 79)。