In this paper, we address the problem of using visuo-tactile feedback for 6-DoF localization and 3D reconstruction of unknown in-hand objects. We propose FingerSLAM, a closed-loop factor graph-based pose estimator that combines local tactile sensing at finger-tip and global vision sensing from a wrist-mount camera. FingerSLAM is constructed with two constituent pose estimators: a multi-pass refined tactile-based pose estimator that captures movements from detailed local textures, and a single-pass vision-based pose estimator that predicts from a global view of the object. We also design a loop closure mechanism that actively matches current vision and tactile images to previously stored key-frames to reduce accumulated error. FingerSLAM incorporates the two sensing modalities of tactile and vision, as well as the loop closure mechanism with a factor graph-based optimization framework. Such a framework produces an optimized pose estimation solution that is more accurate than the standalone estimators. The estimated poses are then used to reconstruct the shape of the unknown object incrementally by stitching the local point clouds recovered from tactile images. We train our system on real-world data collected with 20 objects. We demonstrate reliable visuo-tactile pose estimation and shape reconstruction through quantitative and qualitative real-world evaluations on 6 objects that are unseen during training.
翻译:摘要:本文研究了利用视触觉反馈对未知手持物体进行六自由度定位与三维重建的问题。我们提出了一种基于闭环因子图的位姿估计器FingerSLAM,该方法结合了指尖局部触觉传感与腕部相机全局视觉传感。FingerSLAM由两个子位姿估计器构成:通过多次迭代优化从局部纹理细节捕获运动的触觉位姿估计器,以及从物体全局视角进行单次预测的视觉位姿估计器。我们还设计了闭环机制,通过主动匹配当前视觉/触觉图像与先前存储的关键帧来减小累积误差。FingerSLAM将触觉与视觉两种感知模态,连同闭环机制,整合到基于因子图的优化框架中。该框架能够产生比独立估计器更精确的优化位姿估计结果。进而利用估计的位姿,通过拼接触觉图像恢复的局部点云,逐步重构未知物体的三维形状。我们使用20个物体采集的真实数据训练系统,并通过定量与定性实验证明了系统在6个训练未见过物体上的可靠视触觉位姿估计与形状重建能力。