Visual Odometry (VO) plays a pivotal role in autonomous systems, with a principal challenge being the lack of depth information in camera images. This paper introduces OCC-VO, a novel framework that capitalizes on recent advances in deep learning to transform 2D camera images into 3D semantic occupancy, thereby circumventing the traditional need for concurrent estimation of ego poses and landmark locations. Within this framework, we utilize the TPV-Former to convert surround view cameras' images into 3D semantic occupancy. Addressing the challenges presented by this transformation, we have specifically tailored a pose estimation and mapping algorithm that incorporates Semantic Label Filter, Dynamic Object Filter, and finally, utilizes Voxel PFilter for maintaining a consistent global semantic map. Evaluations on the Occ3D-nuScenes not only showcase a 20.6% improvement in Success Ratio and a 29.6% enhancement in trajectory accuracy against ORB-SLAM3, but also emphasize our ability to construct a comprehensive map. Our implementation is open-sourced and available at: https://github.com/USTCLH/OCC-VO.
翻译:视觉里程计在自主系统中扮演关键角色,其主要挑战在于相机图像缺乏深度信息。本文提出OCC-VO这一创新框架,利用深度学习最新进展将2D相机图像转化为3D语义占用,从而规避传统方法中需同时估计自车姿态与路标位置的局限。在该框架中,我们采用TPV-Former实现环视相机图像到3D语义占用的转换。针对该转换带来的挑战,我们专门设计了包含语义标签滤波器、动态目标滤波器的位姿估计与建图算法,并最终利用Voxel PFilter维护一致的全局语义地图。在Occ3D-nuScenes数据集上的评估表明,与ORB-SLAM3相比,本方法在成功率上提升20.6%,轨迹精度提升29.6%,同时凸显了构建完整地图的能力。本实现已开源,代码地址:https://github.com/USTCLH/OCC-VO。