Visual odometry (VO) and SLAM have been using multi-view geometry via local structure from motion for decades. These methods have a slight disadvantage in challenging scenarios such as low-texture images, dynamic scenarios, etc. Meanwhile, use of deep neural networks to extract high level features is ubiquitous in computer vision. For VO, we can use these deep networks to extract depth and pose estimates using these high level features. The visual odometry task then can be modeled as an image generation task where the pose estimation is the by-product. This can also be achieved in a self-supervised manner, thereby eliminating the data (supervised) intensive nature of training deep neural networks. Although some works tried the similar approach [1], the depth and pose estimation in the previous works are vague sometimes resulting in accumulation of error (drift) along the trajectory. The goal of this work is to tackle these limitations of past approaches and to develop a method that can provide better depths and pose estimates. To address this, a couple of approaches are explored: 1) Modeling: Using optical flow and recurrent neural networks (RNN) in order to exploit spatio-temporal correlations which can provide more information to estimate depth. 2) Loss function: Generative adversarial network (GAN) [2] is deployed to improve the depth estimation (and thereby pose too), as shown in Figure 1. This additional loss term improves the realism in generated images and reduces artifacts.
翻译:视觉里程计(VO)与SLAM数十年来一直通过局部运动恢复结构(Structure from Motion)利用多视图几何。这些方法在低纹理图像、动态场景等挑战性场景中略有不足。与此同时,利用深度神经网络提取高层特征在计算机视觉中已十分普遍。对于VO,我们可以借助这些深度网络通过高层特征估计深度和位姿。因此,视觉里程计任务可被建模为以位姿估计为副产品的图像生成任务。该过程同样可通过自监督方式实现,从而消除深度神经网络训练对数据(监督)密集性的依赖。尽管已有部分研究尝试了类似方法[1],但先前工作中深度与位姿的估计有时较为模糊,导致轨迹上误差(漂移)的累积。本研究旨在解决过往方法的这些局限性,开发一种能提供更优深度与位姿估计的方法。为此,我们探索了两种策略:1)建模:利用光流与循环神经网络(RNN)挖掘时空相关性,为深度估计提供更丰富的信息。2)损失函数:引入生成对抗网络(GAN)[2]改进深度估计(进而提升位姿精度),如图1所示。该额外损失项增强了生成图像的真实感并减少了伪影。