Understanding geometric, semantic, and instance information in 3D scenes from sequential video data is essential for applications in robotics and augmented reality. However, existing Simultaneous Localization and Mapping (SLAM) methods generally focus on either geometric or semantic reconstruction. In this paper, we introduce PanoSLAM, the first SLAM system to integrate geometric reconstruction, 3D semantic segmentation, and 3D instance segmentation within a unified framework. Our approach builds upon 3D Gaussian Splatting, modified with several critical components to enable efficient rendering of depth, color, semantic, and instance information from arbitrary viewpoints. To achieve panoptic 3D scene reconstruction from sequential RGB-D videos, we propose an online Spatial-Temporal Lifting (STL) module that transfers 2D panoptic predictions from vision models into 3D Gaussian representations. This STL module addresses the challenges of label noise and inconsistencies in 2D predictions by refining the pseudo labels across multi-view inputs, creating a coherent 3D representation that enhances segmentation accuracy. Our experiments show that PanoSLAM outperforms recent semantic SLAM methods in both mapping and tracking accuracy. For the first time, it achieves panoptic 3D reconstruction of open-world environments directly from the RGB-D video. (https://github.com/runnanchen/PanoSLAM)
翻译:从序列视频数据中理解三维场景的几何、语义和实例信息对于机器人和增强现实应用至关重要。然而,现有的同时定位与建图(SLAM)方法通常侧重于几何重建或语义重建。本文提出了PanoSLAM,这是首个将几何重建、三维语义分割和三维实例分割集成在统一框架内的SLAM系统。我们的方法建立在3D高斯泼溅技术基础上,通过引入若干关键组件进行改进,实现了从任意视角高效渲染深度、颜色、语义和实例信息的能力。为了从序列RGB-D视频中实现全景三维场景重建,我们提出了一个在线时空提升(STL)模块,该模块将视觉模型的二维全景预测结果转换为三维高斯表示。STL模块通过在多视角输入中优化伪标签,解决了二维预测中标签噪声和不一致性的挑战,从而创建出增强分割准确性的连贯三维表示。实验表明,PanoSLAM在建图与跟踪精度上均优于近期的语义SLAM方法。该系统首次实现了直接从RGB-D视频对开放世界环境进行全景三维重建。(https://github.com/runnanchen/PanoSLAM)