A robot's ability to anticipate the 3D action target location of a hand's movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target's 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in straightforward HRI tasks. The demonstrations showcase the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision. All code and data are open-sourced and can be found on the project website.
翻译:机器从自我中心视频中预判手部动作的三维目标位置的能力,可显著提升人机交互的安全性与效率。以往研究主要关注语义动作分类或二维目标区域预测,而我们认为,预测动作目标的三维坐标有望为更通用的下游机器人任务铺平道路,尤其在头戴式设备日益普及的背景下。本研究扩展了EgoPAT3D——目前唯一专注于自我中心三维动作目标预测的数据集。我们既增加了其规模与多样性,增强了泛化潜力;又通过引入大型预训练模型与人类先验知识,大幅改进了基线算法。值得关注的是,我们的新算法现已能仅凭RGB图像实现更优的预测结果,不再需要此前所需的3D点云与IMU输入。此外,我们将改进的基线算法部署于真实机器人平台,展示了其在简单人机交互任务中的实用价值。这些演示验证了我们技术进展的真实世界适用性,并有望启发更多涉及自我中心视觉的人机交互应用场景。所有代码与数据均已开源,可在项目网站获取。