Learning from Demonstration (LfD) offers a promising paradigm for robot skill acquisition. Recent approaches attempt to extract manipulation commands directly from video demonstrations, yet face two critical challenges: (1) general video captioning models prioritize global scene features over task-relevant objects, producing descriptions unsuitable for precise robotic execution, and (2) end-to-end architectures coupling visual understanding with policy learning require extensive paired datasets and struggle to generalize across objects and scenarios. To address these limitations, we propose a novel ``Human-to-Robot'' imitation learning pipeline that enables robots to acquire manipulation skills directly from unstructured video demonstrations, inspired by the human ability to learn by watching and imitating. Our key innovation is a modular framework that decouples the learning process into two distinct stages: (1) Video Understanding, which combines Temporal Shift Modules (TSM) with Vision-Language Models (VLMs) to extract actions and identify interacted objects, and (2) Robot Imitation, which employs TD3-based deep reinforcement learning to execute the demonstrated manipulations. We validated our approach in PyBullet simulation environments with a UR5e manipulator and in a real-world experiment with a UF850 manipulator across four fundamental actions: reach, pick, move, and put. For video understanding, our method achieves 89.97% action classification accuracy and BLEU-4 scores of 0.351 on standard objects and 0.265 on novel objects, representing improvements of 76.4% and 128.4% over the best baseline, respectively. For robot manipulation, our framework achieves an average success rate of 87.5% across all actions, with 100% success on reaching tasks and up to 90% on complex pick-and-place operations. The project website is available at https://thanhnguyencanh.github.io/LfD4hri.
翻译:从演示中学习为机器人技能获取提供了一种有前景的范式。近期方法试图直接从视频演示中提取操作指令,但仍面临两个关键挑战:(1) 通用视频描述模型优先考虑全局场景特征而非任务相关物体,产生的描述不适合精确的机器人执行;(2) 将视觉理解与策略学习耦合的端到端架构需要大量配对数据集,且难以跨物体和场景泛化。为应对这些局限,我们提出了一种新颖的“人机”模仿学习流程,受人类通过观察和模仿进行学习的能力启发,使机器人能够直接从非结构化视频演示中获取操作技能。我们的核心创新是一个模块化框架,将学习过程解耦为两个独立阶段:(1) 视频理解:结合时序移位模块与视觉语言模型以提取动作并识别交互物体;(2) 机器人模仿:采用基于TD3的深度强化学习来执行演示的操作。我们在PyBullet仿真环境(使用UR5e机械臂)和真实世界实验(使用UF850机械臂)中验证了所提方法,测试了四种基本动作:到达、抓取、移动和放置。在视频理解方面,我们的方法在标准物体上实现了89.97%的动作分类准确率,BLEU-4分数分别为0.351(标准物体)和0.265(新物体),相比最佳基线分别提升了76.4%和128.4%。在机器人操作方面,我们的框架在所有动作上平均成功率达到87.5%,其中到达任务成功率为100%,复杂抓放操作成功率高达90%。项目网站位于 https://thanhnguyencanh.github.io/LfD4hri。