Can we learn robot manipulation for everyday tasks, only by watching videos of humans doing arbitrary tasks in different unstructured settings? Unlike widely adopted strategies of learning task-specific behaviors or direct imitation of a human video, we develop a a framework for extracting agent-agnostic action representations from human videos, and then map it to the agent's embodiment during deployment. Our framework is based on predicting plausible human hand trajectories given an initial image of a scene. After training this prediction model on a diverse set of human videos from the internet, we deploy the trained model zero-shot for physical robot manipulation tasks, after appropriate transformations to the robot's embodiment. This simple strategy lets us solve coarse manipulation tasks like opening and closing drawers, pushing, and tool use, without access to any in-domain robot manipulation trajectories. Our real-world deployment results establish a strong baseline for action prediction information that can be acquired from diverse arbitrary videos of human activities, and be useful for zero-shot robotic manipulation in unseen scenes.
翻译:我们能否仅通过观察人类在不同非结构化环境中执行任意任务的视频,来学习日常任务的机器人操作?与广泛采用的特定任务行为学习或直接模仿人类视频的策略不同,我们开发了一种框架,用于从人类视频中提取与智能体无关的动作表征,并在部署时将其映射到智能体的具体形态。该框架的核心是基于场景初始图像预测合理的人类手部轨迹。通过在来自互联网的多样化人类视频集上训练此预测模型,我们无需任何领域内的机器人操作轨迹数据,即可将训练后的模型零样本部署到物理机器人操作任务中(经适当变换以适应机器人形态)。这一简单策略使我们能够完成开闭抽屉、推拉及工具使用等粗粒度操作任务。我们的真实世界部署结果为从多样化人类活动视频中获取动作预测信息建立了强基线,证明了其在未见场景中实现零样本机器人操作的可行性与有效性。