Human-to-Robot Interaction: Learning from Video Demonstration for Robot Imitation

Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process into two distinct stages: (1) Video Understanding, which combines Temporal Shift Modules (TSM) with Vision-Language Models (VLMs) to extract actions and identify interacted objects, and (2) Robot Imitation, which employs TD3-based deep reinforcement learning to execute the demonstrated manipulations. We validated our approach in PyBullet simulation environments with a UR5e manipulator and in a real-world experiment with a UF850 manipulator across four fundamental actions: reach, pick, move, and put. For video understanding, our method achieves 89.97% action classification accuracy and BLEU-4 scores of 0.351 on standard objects and 0.265 on novel objects, representing improvements of 76.4% and 128.4% over the best baseline, respectively. For robot manipulation, our framework achieves an average success rate of 87.5% across all actions, with 100% success on reaching tasks and up to 90% on complex pick-and-place operations. The project website is available at https://thanhnguyencanh.github.io/LfD4hri.

翻译：从演示中学习为机器人技能获取提供了一种有前景的范式。近期方法试图直接从视频演示中提取操作指令，但仍面临两个关键挑战：(1) 通用视频描述模型优先考虑全局场景特征而非任务相关物体，产生的描述不适合精确的机器人执行；(2) 将视觉理解与策略学习耦合的端到端架构需要大量配对数据集，且难以跨物体和场景泛化。为应对这些局限，我们提出了一种新颖的“人机”模仿学习流程，受人类通过观察和模仿进行学习的能力启发，使机器人能够直接从非结构化视频演示中获取操作技能。我们的核心创新是一个模块化框架，将学习过程解耦为两个独立阶段：(1) 视频理解：结合时序移位模块与视觉语言模型以提取动作并识别交互物体；(2) 机器人模仿：采用基于TD3的深度强化学习来执行演示的操作。我们在PyBullet仿真环境（使用UR5e机械臂）和真实世界实验（使用UF850机械臂）中验证了所提方法，测试了四种基本动作：到达、抓取、移动和放置。在视频理解方面，我们的方法在标准物体上实现了89.97%的动作分类准确率，BLEU-4分数分别为0.351（标准物体）和0.265（新物体），相比最佳基线分别提升了76.4%和128.4%。在机器人操作方面，我们的框架在所有动作上平均成功率达到87.5%，其中到达任务成功率为100%，复杂抓放操作成功率高达90%。项目网站位于 https://thanhnguyencanh.github.io/LfD4hri。