We introduce VIOLA, an object-centric imitation learning approach to learning closed-loop visuomotor policies for robot manipulation. Our approach constructs object-centric representations based on general object proposals from a pre-trained vision model. VIOLA uses a transformer-based policy to reason over these representations and attend to the task-relevant visual factors for action prediction. Such object-based structural priors improve deep imitation learning algorithm's robustness against object variations and environmental perturbations. We quantitatively evaluate VIOLA in simulation and on real robots. VIOLA outperforms the state-of-the-art imitation learning methods by $45.8\%$ in success rate. It has also been deployed successfully on a physical robot to solve challenging long-horizon tasks, such as dining table arrangement and coffee making. More videos and model details can be found in supplementary material and the project website: https://ut-austin-rpl.github.io/VIOLA .
翻译:我们提出VIOLA,一种以目标为中心的模仿学习方法,用于学习机器人操纵的闭环视觉运动策略。该方法基于预训练视觉模型中的通用目标提议构建以目标为中心的表示。VIOLA采用基于Transformer的策略对这些表示进行推理,并关注与任务相关的视觉因素以进行动作预测。这种基于目标的结构先验增强了深度模仿学习算法对目标变化和环境扰动的鲁棒性。我们在仿真环境和真实机器人上对VIOLA进行了定量评估。VIOLA在成功率上比最先进的模仿学习方法高出45.8%。该算法还成功部署于物理机器人,用于解决具有挑战性的长时域任务,如餐桌布置和咖啡制作。更多视频和模型详情见补充材料及项目网站:https://ut-austin-rpl.github.io/VIOLA。