Monocular visual SLAM enables 3D reconstruction from internet video and autonomous navigation on resource-constrained platforms, yet suffers from scale drift, i.e., the gradual divergence of estimated scale over long sequences. Existing frame-to-frame methods achieve real-time performance through local optimization but accumulate scale drift due to the lack of global constraints among independent windows. To address this, we propose SCE-SLAM, an end-to-end SLAM system that maintains scale consistency through scene coordinate embeddings, which are learned patch-level representations encoding 3D geometric relationships under a canonical scale reference. The framework consists of two key modules: geometry-guided aggregation that leverages 3D spatial proximity to propagate scale information from historical observations through geometry-modulated attention, and scene coordinate bundle adjustment that anchors current estimates to the reference scale through explicit 3D coordinate constraints decoded from the scene coordinate embeddings. Experiments on KITTI, Waymo, and vKITTI demonstrate substantial improvements: our method reduces absolute trajectory error by 8.36m on KITTI compared to the best prior approach, while maintaining 36 FPS and achieving scale consistency across large-scale scenes.
翻译:单目视觉SLAM能够从网络视频中进行三维重建并在资源受限平台上实现自主导航,但其存在尺度漂移问题,即在长序列中估计尺度会逐渐发散。现有的帧间方法通过局部优化实现实时性能,但由于独立窗口间缺乏全局约束,会累积尺度漂移。为解决此问题,我们提出了SCE-SLAM,这是一种通过场景坐标嵌入保持尺度一致性的端到端SLAM系统。场景坐标嵌入是学习得到的块级表示,在规范尺度参考下编码三维几何关系。该框架包含两个关键模块:几何引导聚合模块利用三维空间邻近性,通过几何调制注意力从历史观测中传播尺度信息;场景坐标束调整模块通过从场景坐标嵌入解码出的显式三维坐标约束,将当前估计锚定到参考尺度。在KITTI、Waymo和vKITTI数据集上的实验证明了显著改进:与现有最佳方法相比,我们的方法在KITTI上将绝对轨迹误差降低了8.36米,同时保持36 FPS的帧率,并在大规模场景中实现了尺度一致性。