The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
翻译:视觉基础模型的出现彻底改变了视觉里程计(VO)与SLAM,使得姿态估计与稠密重建能够在单一前馈网络中完成。然而,与利用关键帧方法提升效率与准确性的传统流程不同,当前基于基础模型的方法(例如VGGT-Long)通常不加区分地处理原始图像序列。这导致了计算冗余以及由低帧间视差引起的性能下降,因为低视差只能提供有限的上下文立体信息。将传统的几何启发式方法集成到这些模型中并非易事,因为其性能依赖于高维潜在表示而非显式的几何度量。为弥合这一差距,我们提出了一种新颖的基于关键帧的前馈视觉里程计方法。我们的方法不依赖于手工设计的规则,而是采用强化学习以数据驱动的方式推导出自适应关键帧策略,使关键帧选择与底层基础模型的内在特性保持一致。我们在TartanAir数据集上训练智能体,并在多个真实世界数据集上进行了广泛评估。实验结果表明,所提出的方法相较于最先进的前馈视觉里程计方法,取得了持续且显著的性能提升。