Humans effortlessly integrate common-sense knowledge with sensory input from vision and touch to understand their surroundings. Emulating this capability, we introduce FusionSense, a novel 3D reconstruction framework that enables robots to fuse priors from foundation models with highly sparse observations from vision and tactile sensors. FusionSense addresses three key challenges: (i) How can robots efficiently acquire robust global shape information about the surrounding scene and objects? (ii) How can robots strategically select touch points on the object using geometric and common-sense priors? (iii) How can partial observations such as tactile signals improve the overall representation of the object? Our framework employs 3D Gaussian Splatting as a core representation and incorporates a hierarchical optimization strategy involving global structure construction, object visual hull pruning and local geometric constraints. This advancement results in fast and robust perception in environments with traditionally challenging objects that are transparent, reflective, or dark, enabling more downstream manipulation or navigation tasks. Experiments on real-world data suggest that our framework outperforms previously state-of-the-art sparse-view methods. All code and data are open-sourced on the project website.
翻译:人类能够毫不费力地将常识知识与来自视觉和触觉的感官输入相结合,从而理解周围环境。为模拟这种能力,我们提出了FusionSense——一种新颖的三维重建框架,使机器人能够将来自基础模型的先验知识与视觉及触觉传感器获取的高度稀疏观测相融合。FusionSense解决了三个关键挑战:(i) 机器人如何高效获取关于周围场景和物体的鲁棒全局形状信息?(ii) 机器人如何利用几何与常识先验策略性地选择物体上的触觉采样点?(iii) 触觉信号等局部观测如何提升物体的整体表征质量?本框架采用3D Gaussian Splatting作为核心表征方法,并引入包含全局结构构建、物体视觉外壳剪枝与局部几何约束的分层优化策略。该进展使得在透明、反光或暗色等传统挑战性物体构成的环境中实现快速鲁棒的感知成为可能,从而支持更多下游操作或导航任务。在真实数据上的实验表明,本框架性能优于以往最先进的稀疏视角重建方法。所有代码与数据已在项目网站开源。