We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.
翻译:本文提出FoundationSLAM,一种基于学习的单目密集SLAM系统,旨在解决以往基于光流的方法中几何一致性缺失的问题,以实现精确且鲁棒的跟踪与建图。我们的核心思想是通过利用深度基础模型的引导,将光流估计与几何推理相融合。为此,我们首先开发了一种混合光流网络,该网络能够生成具有几何感知的对应关系,从而在不同关键帧之间实现一致的深度与姿态推断。为了确保全局一致性,我们提出了一种双一致束调整层,该层在多视图约束下联合优化关键帧姿态与深度。此外,我们引入了一种可靠性感知的精细化机制,通过区分可靠区域与不确定区域来自适应地调整光流更新过程,从而在匹配与优化之间形成闭环反馈。大量实验表明,FoundationSLAM在多个具有挑战性的数据集上均实现了卓越的轨迹精度与密集重建质量,同时以18 FPS的速度实时运行,证明了我们的方法在各种场景下的强大泛化能力与实际适用性。