Learning from human demonstrations has exhibited remarkable achievements in robot manipulation. However, the challenge remains to develop a robot system that matches human capabilities and data efficiency in learning and generalizability, particularly in complex, unstructured real-world scenarios. We propose a system that processes RGBD videos to translate human actions to robot primitives and identifies task-relevant key poses of objects using Grounded Segment Anything. We then address challenges for robots in replicating human actions, considering the human-robot differences in kinematics and collision geometry. To test the effectiveness of our system, we conducted experiments focusing on manual dishwashing. With a single human demonstration recorded in a mockup kitchen, the system achieved 50-100% success for each step and up to a 40% success rate for the whole task with different objects in a home kitchen. Videos are available at https://robot-dishwashing.github.io
翻译:从人类演示中学习已在机器人操作领域取得显著成就。然而,如何开发出在学习和泛化能力上匹配人类水平与数据效率的机器人系统,尤其是在复杂非结构化的真实场景中,仍是一项挑战。我们提出了一种系统,该系统通过处理RGBD视频将人类动作转化为机器人基元操作,并利用Grounded Segment Anything识别与任务相关的物体关键位姿。随后,我们解决了机器人复现人类动作时的挑战,考虑了人机在运动学和碰撞几何上的差异。为测试系统有效性,我们以手动洗碗为任务开展实验。基于在模拟厨房中录制的单次人类演示,该系统在家庭厨房中使用不同物体时,各步骤成功率达50-100%,完整任务成功率达40%。视频见https://robot-dishwashing.github.io