Despite recent progress in calibration-free monocular SLAM via 3D vision foundation models, scale drift remains severe on long sequences. Motion-agnostic partitioning breaks contextual coherence and causes zero-motion drift, while conventional geometric alignment is computationally expensive. To address these issues, we propose VGGT-Motion, a calibration-free SLAM system for efficient and robust global consistency over kilometer-scale trajectories. Specifically, we first propose a motion-aware submap construction mechanism that uses optical flow to guide adaptive partitioning, prune static redundancy, and encapsulate turns for stable local geometry. We then design an anchor-driven direct Sim(3) registration strategy. By exploiting context-balanced anchors, it achieves search-free, pixel-wise dense alignment and efficient loop closure without costly feature matching. Finally, a lightweight submap-level pose graph optimization enforces global consistency with linear complexity, enabling scalable long-range operation. Experiments show that VGGT-Motion markedly improves trajectory accuracy and efficiency, achieving state-of-the-art performance in zero-shot, long-range calibration-free monocular SLAM.
翻译:尽管基于3D视觉基础模型的免标定单目SLAM近期取得了进展,但在长序列上尺度漂移问题依然严重。运动无关的子图划分会破坏上下文连贯性并导致零运动漂移,而传统的几何对齐方法计算成本高昂。为解决这些问题,我们提出了VGGT-Motion,一种面向公里级轨迹高效鲁棒全局一致性的免标定SLAM系统。具体而言,我们首先提出一种运动感知的子图构建机制,利用光流指导自适应划分、剔除静态冗余并封装转向动作以维持稳定的局部几何结构。随后,我们设计了一种锚点驱动的直接Sim(3)配准策略。该方法通过利用上下文平衡的锚点,实现了无需搜索的像素级稠密对齐与高效闭环,避免了昂贵的特征匹配开销。最后,采用轻量子图级位姿图优化以线性复杂度强制全局一致性,从而实现可扩展的长程运行。实验表明,VGGT-Motion显著提升了轨迹精度与效率,在零样本长程免标定单目SLAM任务中达到了最先进的性能水平。