Imitation Learning (IL) enables robots to learn complex skills from demonstrations without explicit task modeling, but it typically requires large amounts of demonstrations, creating significant collection costs. Prior work has investigated using flow as an intermediate representation to enable the use of human videos as a substitute, thereby reducing the amount of required robot demonstrations. However, most prior work has focused on the flow, either on the object or on specific points of the robot/hand, which cannot describe the motion of interaction. Meanwhile, relying on flow to achieve generalization to scenarios observed only in human videos remains limited, as flow alone cannot capture precise motion details. Furthermore, conditioning on scene observation to produce precise actions may cause the flow-conditioned policy to overfit to training tasks and weaken the generalization indicated by the flow. To address these gaps, we propose SFCrP, which includes a Scene Flow prediction model for Cross-embodiment learning (SFCr) and a Flow and Cropped point cloud conditioned Policy (FCrP). SFCr learns from both robot and human videos and predicts any point trajectories. FCrP follows the general flow motion and adjusts the action based on observations for precision tasks. Our method outperforms SOTA baselines across various real-world task settings, while also exhibiting strong spatial and instance generalization to scenarios seen only in human videos.
翻译:模仿学习(Imitation Learning, IL)使机器人能够在不进行显式任务建模的情况下从演示中学习复杂技能,但其通常需要大量演示数据,导致高昂的采集成本。先前研究探索使用光流作为中间表示,以人类视频作为替代数据源,从而减少所需机器人演示的数量。然而,现有工作大多聚焦于物体或机器人/手部特定点上的光流,此类表示无法完整描述交互运动。同时,仅依赖光流实现对人类视频中观测场景的泛化能力仍有限制,因为单纯的光流无法捕捉精确的运动细节。此外,依赖场景观测生成精确动作可能导致基于光流的策略对训练任务过拟合,削弱光流所指示的泛化能力。为弥补这些不足,我们提出SFCrP方法,包含用于跨具身学习的场景流预测模型(SFCr)以及基于流与裁剪点云的条件策略(FCrP)。SFCr从机器人及人类视频中学习,并预测任意点的运动轨迹。FCrP遵循通用的流运动模式,并根据观测调整动作以执行精确任务。我们的方法在多种真实世界任务设定中均优于当前最先进的基线模型,同时对仅见于人类视频的场景展现出强大的空间与实例泛化能力。