Monocular visual odometry is a key technology in a wide variety of autonomous systems. Relative to traditional feature-based methods, that suffer from failures due to poor lighting, insufficient texture, large motions, etc., recent learning-based SLAM methods exploit iterative dense bundle adjustment to address such failure cases and achieve robust accurate localization in a wide variety of real environments, without depending on domain-specific training data. However, despite its potential, learning-based SLAM still struggles with scenarios involving large motion and object dynamics. In this paper, we diagnose key weaknesses in a popular learning-based SLAM model (DROID-SLAM) by analyzing major failure cases on outdoor benchmarks and exposing various shortcomings of its optimization process. We then propose the use of self-supervised priors leveraging a frozen large-scale pre-trained monocular depth estimation to initialize the dense bundle adjustment process, leading to robust visual odometry without the need to fine-tune the SLAM backbone. Despite its simplicity, our proposed method demonstrates significant improvements on KITTI odometry, as well as the challenging DDAD benchmark. Code and pre-trained models will be released upon publication.
翻译:单目视觉里程计是各类自主系统中的关键技术。相对于传统的基于特征的方法(在光照条件差、纹理不足、运动幅度大等情况下容易失效),近期基于学习的SLAM方法利用迭代式密集光束法平差来处理此类失效情况,从而在多种真实环境中实现鲁棒且精确的定位,且无需依赖特定领域的训练数据。然而,尽管具有潜力,基于学习的SLAM在面对大幅运动和动态物体的场景时仍存在困难。本文通过分析户外基准测试中的主要失效案例,并揭示主流学习型SLAM模型(DROID-SLAM)优化过程的若干缺陷,从而诊断其关键弱点。随后,我们提出利用冻结的大规模预训练单目深度估计模型提供的自监督先验知识来初始化密集光束法平差过程,以此实现鲁棒的视觉里程计,且无需对SLAM主干网络进行微调。尽管方法简洁,我们提出的方案在KITTI里程计基准及更具挑战性的DDAD基准上均展现出显著性能提升。代码与预训练模型将在论文发表时同步开源。