Accurate estimation of the relative pose between an object and a robot hand is critical for many manipulation tasks. However, most of the existing object-in-hand pose datasets use two-finger grippers and also assume that the object remains fixed in the hand without any relative movements, which is not representative of real-world scenarios. To address this issue, a 6D object-in-hand pose dataset is proposed using a teleoperation method with an anthropomorphic Shadow Dexterous hand. Our dataset comprises RGB-D images, proprioception and tactile data, covering diverse grasping poses, finger contact states, and object occlusions. To overcome the significant hand occlusion and limited tactile sensor contact in real-world scenarios, we propose PoseFusion, a hybrid multi-modal fusion approach that integrates the information from visual and tactile perception channels. PoseFusion generates three candidate object poses from three estimators (tactile only, visual only, and visuo-tactile fusion), which are then filtered by a SelectLSTM network to select the optimal pose, avoiding inferior fusion poses resulting from modality collapse. Extensive experiments demonstrate the robustness and advantages of our framework. All data and codes are available on the project website: https://elevenjiang1.github.io/ObjectInHand-Dataset/
翻译:精准估计物体与机器人手之间的相对位姿对于许多操作任务至关重要。然而,现有大多数手内物体位姿数据集采用两指夹持器,并假设物体在手中保持固定无相对运动,这无法真实反映实际场景。为应对这一问题,本文提出一种基于遥操作方法的6D手内物体位姿数据集,使用拟人化的Shadow Dexterous手。该数据集包含RGB-D图像、本体感知与触觉数据,覆盖了多种抓取姿态、手指接触状态及物体遮挡情况。为克服实际场景中严重的手部遮挡与有限的触觉传感器接触问题,我们提出PoseFusion——一种融合视觉与触觉感知通道信息的混合多模态融合方法。PoseFusion通过三个估计器(仅触觉、仅视觉、视觉-触觉融合)生成三个候选物体位姿,随后经由SelectLSTM网络筛选最优位姿,从而避免因模态坍塌导致的劣质融合结果。大量实验证明了该框架的鲁棒性与优越性。所有数据与代码均可在项目网站获取:https://elevenjiang1.github.io/ObjectInHand-Dataset/