Estimating the camera pose given images of a single camera is a traditional task in mobile robots and autonomous vehicles. This problem is called monocular visual odometry and it often relies on geometric approaches that require engineering effort for a specific scenario. Deep learning methods have shown to be generalizable after proper training and a considerable amount of available data. Transformer-based architectures have dominated the state-of-the-art in natural language processing and computer vision tasks, such as image and video understanding. In this work, we deal with the monocular visual odometry as a video understanding task to estimate the 6-DoF camera's pose. We contribute by presenting the TSformer-VO model based on spatio-temporal self-attention mechanisms to extract features from clips and estimate the motions in an end-to-end manner. Our approach achieved competitive state-of-the-art performance compared with geometry-based and deep learning-based methods on the KITTI visual odometry dataset, outperforming the DeepVO implementation highly accepted in the visual odometry community.
翻译:估计单目相机图像的相机位姿是移动机器人和自动驾驶中的传统任务。该问题称为单目视觉里程计,通常依赖于几何方法,需要针对特定场景进行工程调整。深度学习方法在适当训练和大量可用数据下已显示出良好的泛化能力。基于Transformer的架构在自然语言处理和计算机视觉任务(如图像和视频理解)中主导了最新技术。在本工作中,我们将单目视觉里程计视为视频理解任务,以估计6自由度相机位姿。我们贡献了TSformer-VO模型,该模型基于时空自注意力机制,从片段中提取特征并以端到端方式估计运动。与基于几何和基于深度学习的方法相比,我们的方法在KITTI视觉里程计数据集上取得了具有竞争力的最新性能,超越了视觉里程计社区广泛接受的DeepVO实现。