RGBManip: Monocular Image-based Robotic Manipulation through Active Object Pose Estimation

Robotic manipulation requires accurate perception of the environment, which poses a significant challenge due to its inherent complexity and constantly changing nature. In this context, RGB image and point-cloud observations are two commonly used modalities in visual-based robotic manipulation, but each of these modalities have their own limitations. Commercial point-cloud observations often suffer from issues like sparse sampling and noisy output due to the limits of the emission-reception imaging principle. On the other hand, RGB images, while rich in texture information, lack essential depth and 3D information crucial for robotic manipulation. To mitigate these challenges, we propose an image-only robotic manipulation framework that leverages an eye-on-hand monocular camera installed on the robot's parallel gripper. By moving with the robot gripper, this camera gains the ability to actively perceive object from multiple perspectives during the manipulation process. This enables the estimation of 6D object poses, which can be utilized for manipulation. While, obtaining images from more and diverse viewpoints typically improves pose estimation, it also increases the manipulation time. To address this trade-off, we employ a reinforcement learning policy to synchronize the manipulation strategy with active perception, achieving a balance between 6D pose accuracy and manipulation efficiency. Our experimental results in both simulated and real-world environments showcase the state-of-the-art effectiveness of our approach. %, which, to the best of our knowledge, is the first to achieve robust real-world robotic manipulation through active pose estimation. We believe that our method will inspire further research on real-world-oriented robotic manipulation.

翻译：机器人操作需要对环境进行精确感知，但由于环境固有的复杂性和动态变化特性，这构成了重大挑战。在此背景下，RGB图像和点云观测是基于视觉的机器人操作中两种常用模态，但每种模态均存在自身局限性。商用点云观测常因发射-接收成像原理的限制而出现稀疏采样和噪声输出等问题。而RGB图像虽富含纹理信息，却缺乏机器人操作所需的关键深度和三维信息。为缓解这些挑战，我们提出了一种纯图像驱动的机器人操作框架，该框架利用安装在机器人平行夹爪上的眼在手式单目摄像头。该摄像头随机器人夹爪运动，能够在操作过程中从多个视角主动感知物体，从而估计可用于操作的6D物体姿态。尽管从更多样化的视角获取图像通常能提升姿态估计精度，但也会增加操作时间。为解决这一权衡问题，我们采用强化学习策略来协调操作策略与主动感知，实现6D姿态精度与操作效率之间的平衡。我们在仿真和真实环境中的实验结果展示了该方法的前沿有效性。我们相信，本方法将激发更多面向真实场景的机器人操作研究。