Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.
翻译:单目视觉里程计(MVO)是自主导航和机器人定位的基础。然而,现有的基于学习的MVO方法往往面临可解释性、互补特征缺失或架构过于复杂的困境。这些局限性本质上限制了其鲁棒性和跨领域泛化能力。在本文中,我们提出MVOFormer——一种用于鲁棒单目视觉里程计的新型Transformer框架。其架构包含一个流-语义双分支编码器,该编码器协同融合稠密几何运动线索与面向对象的语义先验,明确区分静态结构与动态干扰物。随后,通过迭代多模态解码器对这些表征进行融合,实现从粗到精的位姿优化,同时动态抑制对不可靠区域的注意力。大量实验表明:在不进行任何目标域微调的情况下,MVOFormer在零样本泛化与鲁棒性方面表现卓越,在TartanAir、KITTI、TUM-RGBD及ETH3D-SLAM等多个基准测试中显著优于先前的基于学习的帧到帧方法。