We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task. Specifically, we find audio prediction task to significantly enhance the semi-supervised learning process while alleviating noisy pseudo-labels, particularly in highly dynamic and out-of-domain video data. Our proposed teacher network achieves state-of-the-art performance on the commonly used KITTI benchmark despite no multi-frame optimization or knowledge of camera parameters. Combined with the proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge transfer across diverse conditions on KITTI, nuScenes, and Argoverse without fine-tuning.
翻译:我们提出XVO,一种半监督学习方法,用于训练具有跨数据集和场景鲁棒即用能力的广义单目视觉里程计(VO)模型。与通常依赖已知标定参数在单一数据集上进行研究的标准单目VO方法不同,XVO能够从视觉场景语义中高效学习恢复具有真实世界尺度的相对位姿,即无需依赖任何已知的相机参数。我们通过从YouTube上大量无约束且异质的行车记录仪视频中进行自训练来优化运动估计模型。本文的关键贡献有两方面:第一,我们实证证明了半监督训练对于学习通用直接VO回归网络的优势;第二,我们展示了多模态监督(包括分割、光流、深度及音频辅助预测任务)有助于促进VO任务的泛化表征。具体而言,我们发现音频预测任务能够显著增强半监督学习过程,同时抑制伪标签噪声,尤其是在高度动态和领域外视频数据中。我们提出的教师网络在常用KITTI基准上实现了最先进的性能,且无需多帧优化或相机参数。结合所提出的半监督步骤,XVO在KITTI、nuScenes和Argoverse等不同场景下展现了无需微调的即用知识迁移能力。