To realize human-robot collaboration, robots need to execute actions for new tasks according to human instructions given finite prior knowledge. Human experts can share their knowledge of how to perform a task with a robot through multi-modal instructions in their demonstrations, showing a sequence of short-horizon steps to achieve a long-horizon goal. This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data. We built a system that accomplishes various cooking actions, where an arm robot executes a DMP sequence acquired from a cooking video using the audio-visual Transformer. Experiments with Epic-Kitchen-100, YouCookII, QuerYD, and in-house instruction video datasets show that the proposed method improves the quality of DMP sequences by 2.3 times the METEOR score obtained with a baseline video-to-action Transformer. The model achieved 32% of the task success rate with the task knowledge of the object.
翻译:为实现人机协作,机器人需依据人类指令,在有限先验知识下执行新任务的动作。人类专家可通过演示中的多模态指令,与机器人共享任务执行知识,展示实现长期目标所需的短期步骤序列。本文提出一种从教学视频生成机器人动作序列的方法,该方法包含:(1) 将视听特征与指令语音转换为机器人动作序列(称为动态运动基元,DMPs)的视听Transformer;(2) 基于风格迁移的训练策略,通过视频字幕的多任务学习与语义分类器的弱监督学习,利用非配对视频-动作数据进行训练。我们构建了可完成多种烹饪动作的系统,其中机械臂利用视听Transformer从烹饪视频中获取的DMP序列执行动作。在Epic-Kitchen-100、YouCookII、QuerYD及内部教学视频数据集上的实验表明,所提方法将DMP序列质量提升至基线视频-动作Transformer的METEOR分数的2.3倍。结合物体任务知识后,模型的任务成功率达到32%。