Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io
翻译:人类能够通过观察他人来学习操作新物体;赋予机器人从此类演示中学习的能力,将为实现指定新行为的自然交互界面提供可能。本研究提出了“机器人见机器人做”(RSRD)方法,用于根据单次静态多视角物体扫描和单目RGB人类演示视频,模仿铰接物体的操作。我们首先提出了四维可微分部件模型(4D-DPM),这是一种通过可微分渲染从单目视频恢复三维部件运动的方法。该分析合成方法采用部件中心特征场进行迭代优化,使得仅凭单段视频就能利用几何正则化器恢复三维运动。基于此四维重建结果,机器人通过规划能引发演示物体部件运动轨迹的双臂运动来复现物体轨迹。通过将演示表示为部件中心轨迹,RSRD专注于复现演示的预期行为,同时考虑机器人自身的形态限制,而非试图复制手部运动。我们在真实标注的三维部件轨迹上评估了4D-DPM的三维跟踪精度,并在双臂YuMi机器人上对9个物体各进行10次试验,评估了RSRD的物理执行性能。RSRD各阶段平均成功率达到87%,90次试验的端到端总成功率为60%。值得注意的是,该方法仅使用从大规模预训练视觉模型中提取的特征场实现,无需任何任务特定训练、微调、数据集收集或标注。项目页面:https://robot-see-robot-do.github.io