We present VGGT-SLAM 2.0, a real time RGB feed-forward SLAM system which substantially improves upon VGGT-SLAM for incrementally aligning submaps created from VGGT. Firstly, we remove high-dimensional 15-degree-of-freedom drift and planar degeneracy from VGGT-SLAM by creating a new factor graph design while still addressing the reconstruction ambiguity of VGGT given unknown camera intrinsics. Secondly, by studying the attention layers of VGGT, we show that one of the layers is well suited to assist in image retrieval verification for free without additional training, which enables both rejecting false positive matches and allows for completing more loop closures. Finally, we conduct a suite of experiments which includes showing VGGT-SLAM 2.0 can easily be adapted for open-set object detection and demonstrating real time performance while running online onboard a ground robot using a Jetson Thor. We also test in environments ranging from cluttered indoor apartments and office scenes to a 4,200 square foot barn, and we also demonstrate VGGT-SLAM 2.0 achieves the highest accuracy on the TUM dataset with about 23 percent less pose error than VGGT-SLAM. Code will be released upon publication.
翻译:本文提出VGGT-SLAM 2.0,一种实时RGB前馈SLAM系统,在VGGT-SLAM基础上实现了显著改进,用于增量式对齐由VGGT生成的子地图。首先,我们通过设计新的因子图结构,消除了VGGT-SLAM中高维15自由度漂移和平面退化问题,同时仍能处理相机内参未知时VGGT的重建歧义性。其次,通过研究VGGT的注意力层,我们发现其中一层无需额外训练即可有效辅助图像检索验证,既能拒绝误匹配,又能完成更多回环闭合。最后,我们进行了一系列实验,包括展示VGGT-SLAM 2.0可轻松适配开放集目标检测,并在Jetson Thor平台上实现地面机器人的在线实时运行。我们在从杂乱室内公寓、办公室场景到4200平方英尺谷仓等多种环境中进行测试,并证明VGGT-SLAM 2.0在TUM数据集上达到最高精度,其位姿误差较VGGT-SLAM降低约23%。代码将在论文发表时开源。