Humans have the remarkable ability to use held objects as tools to interact with their environment. For this to occur, humans internally estimate how hand movements affect the object's movement. We wish to endow robots with this capability. We contribute methodology to jointly estimate the geometry and pose of objects grasped by a robot, from RGB images captured by an external camera. Notably, our method transforms the estimated geometry into the robot's coordinate frame, while not requiring the extrinsic parameters of the external camera to be calibrated. Our approach leverages 3D foundation models, large models pre-trained on huge datasets for 3D vision tasks, to produce initial estimates of the in-hand object. These initial estimations do not have physically correct scales and are in the camera's frame. Then, we formulate, and efficiently solve, a coordinate-alignment problem to recover accurate scales, along with a transformation of the objects to the coordinate frame of the robot. Forward kinematics mappings can subsequently be defined from the manipulator's joint angles to specified points on the object. These mappings enable the estimation of points on the held object at arbitrary configurations, enabling robot motion to be designed with respect to coordinates on the grasped objects. We empirically evaluate our approach on a robot manipulator holding a diverse set of real-world objects.
翻译:人类具备利用手持物体作为工具与环境交互的非凡能力。为实现这一行为,人类需在内部估计手部运动如何影响物体的运动。我们希望赋予机器人这种能力。本文提出一种从外部摄像头采集的RGB图像中,联合估计机器人抓取物体几何形状与姿态的方法。值得注意的是,该方法将估计的几何结构转换至机器人坐标系,且无需对外部摄像头的外参进行标定。我们的方法利用三维基础模型——即在海量数据集上预训练、用于三维视觉任务的大规模模型——来生成手持物体的初始估计。这些初始估计不具备物理正确的尺度,且处于摄像头坐标系中。随后,我们构建并高效求解了一个坐标对齐问题,以恢复精确尺度,并将物体转换至机器人坐标系。随后可通过机械臂关节角度到物体指定点的前向运动学映射,估计抓取物体在任意构型下的位点。这些映射支持基于抓取物体坐标设计机器人运动。我们在搭载多样化真实物体的机器人机械臂上对所提方法进行了实证评估。