Visual SLAM algorithms achieve significant improvements through the exploration of 3D Gaussian Splatting (3DGS) representations, particularly in generating high-fidelity dense maps. However, they depend on a static environment assumption and experience significant performance degradation in dynamic environments. This paper presents GGD-SLAM, a framework that employs a generalizable motion model to address the challenges of localization and dense mapping in dynamic environments - without predefined semantic annotations or depth input. Specifically, the proposed system employs a First-In-First-Out (FIFO) queue to manage incoming frames, facilitating dynamic semantic feature extraction through a sequential attention mechanism. This is integrated with a dynamic feature enhancer to separate static and dynamic components. Additionally, to minimize dynamic distractors' impact on the static components, we devise a method to fill occluded areas via static information sampling and design a distractor-adaptive Structure Similarity Index Measure (SSIM) loss tailored for dynamic environments, significantly enhancing the system's resilience. Experiments conducted on real-world dynamic datasets demonstrate that the proposed system achieves state-of-the-art performance in camera pose estimation and dense reconstruction in dynamic scenes.
翻译:视觉SLAM算法通过探索3D高斯溅射(3DGS)表示取得了显著进展,尤其在生成高保真度稠密地图方面。然而,这些算法依赖于静态环境假设,在动态环境中会出现严重的性能退化。本文提出了GGD-SLAM框架,该框架采用可泛化运动模型来应对动态环境中的定位与稠密地图构建挑战——无需预定义语义标注或深度输入。具体而言,所提系统通过先进先出(FIFO)队列管理输入帧,借助序列注意力机制实现动态语义特征提取,并与动态特征增强器集成以分离静态与动态成分。此外,为最小化动态干扰因素对静态成分的影响,我们设计了通过静态信息采样填补遮挡区域的方法,并针对动态环境构建了干扰自适应结构相似性指数度量(SSIM)损失函数,显著增强了系统的鲁棒性。在真实动态数据集上的实验表明,所提系统在动态场景的相机位姿估计与稠密重建方面达到了最先进性能。