UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models

In classical computer vision, rectification is an integral part of multi-view depth estimation. It typically includes epipolar rectification and lens distortion correction. This process simplifies the depth estimation significantly, and thus it has been adopted in CNN approaches. However, rectification has several side effects, including a reduced field of view (FOV), resampling distortion, and sensitivity to calibration errors. The effects are particularly pronounced in case of significant distortion (e.g., wide-angle fisheye cameras). In this paper, we propose a generic scale-aware self-supervised pipeline for estimating depth, euclidean distance, and visual odometry from unrectified monocular videos. We demonstrate a similar level of precision on the unrectified KITTI dataset with barrel distortion comparable to the rectified KITTI dataset. The intuition being that the rectification step can be implicitly absorbed within the CNN model, which learns the distortion model without increasing complexity. Our approach does not suffer from a reduced field of view and avoids computational costs for rectification at inference time. To further illustrate the general applicability of the proposed framework, we apply it to wide-angle fisheye cameras with 190$^\circ$ horizontal field of view. The training framework UnRectDepthNet takes in the camera distortion model as an argument and adapts projection and unprojection functions accordingly. The proposed algorithm is evaluated further on the KITTI rectified dataset, and we achieve state-of-the-art results that improve upon our previous work FisheyeDistanceNet. Qualitative results on a distorted test scene video sequence indicate excellent performance https://youtu.be/K6pbx3bU4Ss.

翻译：在经典计算机视觉中，图像校正（rectification）是多视角深度估计的核心环节，通常包括极线校正与镜头畸变校正。这一过程显著简化了深度估计，因此被卷积神经网络方法广泛采用。然而，图像校正会带来视野（FOV）缩小、重采样畸变以及对标定误差敏感等副作用，尤其在存在显著畸变（如广角鱼眼相机）时更为突出。本文提出一种通用尺度感知自监督流水线，可从未经校正的单目视频中估计深度、欧氏距离及视觉里程计。我们在存在桶形畸变的未校正KITTI数据集上展示了与校正后KITTI数据集相当的精密度。其核心思想在于：校正步骤可隐式嵌入CNN模型内部，使模型在不增加复杂度的情况下学习畸变模型。本方法不会受视野缩小影响，并避免了推理时的校正计算开销。为进一步验证该框架的通用性，我们将其应用于水平视角达190°的广角鱼眼相机。训练框架UnRectDepthNet将相机畸变模型作为输入参数，并据此自适应调整投影与反投影函数。在KITTI校正数据集上的评估表明，本算法达到了领先水平，并优于我们此前工作FisheyeDistanceNet。在畸变测试场景视频序列上的定性结果展现了优异性能 https://youtu.be/K6pbx3bU4Ss。