Representing the environment is a central challenge in robotics, and is essential for effective decision-making. Traditionally, before capturing images with a manipulator-mounted camera, users need to calibrate the camera using a specific external marker, such as a checkerboard or AprilTag. However, recent advances in computer vision have led to the development of \emph{3D foundation models}. These are large, pre-trained neural networks that can establish fast and accurate multi-view correspondences with very few images, even in the absence of rich visual features. This paper advocates for the integration of 3D foundation models into scene representation approaches for robotic systems equipped with manipulator-mounted RGB cameras. Specifically, we propose the Joint Calibration and Representation (JCR) method. JCR uses RGB images, captured by a manipulator-mounted camera, to simultaneously construct an environmental representation and calibrate the camera relative to the robot's end-effector, in the absence of specific calibration markers. The resulting 3D environment representation is aligned with the robot's coordinate frame and maintains physically accurate scales. We demonstrate that JCR can build effective scene representations using a low-cost RGB camera attached to a manipulator, without prior calibration.
翻译:环境表示是机器人领域的核心挑战,对实现高效决策至关重要。传统上,在使用机械臂搭载相机采集图像前,用户需要借助特定外部标记(如棋盘格或AprilTag)对相机进行标定。然而,计算机视觉领域的最新进展催生了\emph{3D基础模型}——这类大规模预训练神经网络能够在缺乏丰富视觉特征的情况下,仅凭少量图像即可建立快速且准确的多视角对应关系。本文倡导将3D基础模型融入配备机械臂RGB相机的机器人系统场景表示方法中,具体提出联合标定与表示(JCR)方法。JCR利用机械臂搭载相机采集的RGB图像,在无专用标定标记的条件下,同步构建环境表示并完成相机相对于机器人末端执行器的标定。最终生成的3D环境表示与机器人坐标系对齐,并保持物理精确的尺度。实验证明,JCR能够通过低成本机械臂搭载RGB相机,在无需预先标定的情况下构建有效的场景表示。