Estimating robot pose from RGB images is a crucial problem in computer vision and robotics. While previous methods have achieved promising performance, most of them presume full knowledge of robot internal states, e.g. ground-truth robot joint angles, which are not always available in real-world scenarios. On the other hand, existing approaches that estimate robot pose without joint state priors suffer from heavy computation burdens and thus cannot support real-time applications. This work addresses the urgent need for efficient robot pose estimation with unknown states. We propose an end-to-end pipeline for real-time, holistic robot pose estimation from a single RGB image, even in the absence of known robot states. Our method decomposes the problem into estimating camera-to-robot rotation, robot state parameters, keypoint locations, and root depth. We further design a corresponding neural network module for each task. This approach allows for learning multi-facet representations and facilitates sim-to-real transfer through self-supervised learning. Notably, our method achieves inference with a single feedforward, eliminating the need for costly test-time iterative optimization. As a result, it delivers a 12-time speed boost with state-of-the-art accuracy, enabling real-time holistic robot pose estimation for the first time. Code is available at https://oliverbansk.github.io/Holistic-Robot-Pose/.
翻译:从RGB图像中估计机器人姿态是计算机视觉与机器人学中的关键问题。尽管现有方法已取得显著性能,但大多数方法预设机器人内部状态(如真实关节角度)为已知信息,而在实际场景中这些状态并不总能获取。另一方面,现有无关节状态先验的机器人姿态估计方法存在计算负担过重的问题,因而难以支持实时应用。本文针对未知状态下高效机器人姿态估计的迫切需求,提出了一种端到端流水线,实现在无已知机器人状态的情况下,通过单张RGB图像进行实时全链式姿态估计。该方法将问题分解为相机-机器人旋转估计、机器人状态参数估计、关键点定位与根部深度估计,并针对每个子任务设计相应的神经网络模块。该策略支持多维度表征学习,并通过自监督学习促进仿真到真实(sim-to-real)的迁移。值得注意的是,我们的方法通过单次前向传播完成推理,无需测试时进行昂贵的迭代优化。最终,该方法在保持最优精度的同时实现了12倍的速度提升,首次实现了实时全链式机器人姿态估计。代码开源地址:https://oliverbansk.github.io/Holistic-Robot-Pose/。