Single-Stage 3D Geometry-Preserving Depth Estimation Model Training on Dataset Mixtures with Uncalibrated Stereo Data

Nowadays, robotics, AR, and 3D modeling applications attract considerable attention to single-view depth estimation (SVDE) as it allows estimating scene geometry from a single RGB image. Recent works have demonstrated that the accuracy of an SVDE method hugely depends on the diversity and volume of the training data. However, RGB-D datasets obtained via depth capturing or 3D reconstruction are typically small, synthetic datasets are not photorealistic enough, and all these datasets lack diversity. The large-scale and diverse data can be sourced from stereo images or stereo videos from the web. Typically being uncalibrated, stereo data provides disparities up to unknown shift (geometrically incomplete data), so stereo-trained SVDE methods cannot recover 3D geometry. It was recently shown that the distorted point clouds obtained with a stereo-trained SVDE method can be corrected with additional point cloud modules (PCM) separately trained on the geometrically complete data. On the contrary, we propose GP$^{2}$, General-Purpose and Geometry-Preserving training scheme, and show that conventional SVDE models can learn correct shifts themselves without any post-processing, benefiting from using stereo data even in the geometry-preserving setting. Through experiments on different dataset mixtures, we prove that GP$^{2}$-trained models outperform methods relying on PCM in both accuracy and speed, and report the state-of-the-art results in the general-purpose geometry-preserving SVDE. Moreover, we show that SVDE models can learn to predict geometrically correct depth even when geometrically complete data comprises the minor part of the training set.

翻译：如今，机器人技术、增强现实和三维建模应用对单视图深度估计（SVDE）引起了广泛关注，因为它能从单张RGB图像中估计场景几何结构。近期研究表明，SVDE方法的准确性在很大程度上取决于训练数据的多样性和规模。然而，通过深度捕获或三维重建获得的RGB-D数据集通常规模较小，合成数据集的光真实感不足，且所有这些数据集缺乏多样性。大规模且多样化的数据可源自网络上的立体图像或立体视频。这些立体数据通常未标定，提供的视差存在未知偏移（几何不完整数据），因此基于立体训练的SVDE方法无法恢复三维几何结构。最近研究表明，使用立体训练SVDE方法获得的失真点云可以通过额外在几何完整数据上单独训练的点云模块（PCM）进行校正。与此相反，我们提出了GP$^{2}$（通用且几何保持训练方案），并证明传统SVDE模型无需任何后处理即可自行学习正确的偏移，即使在几何保持设置下也能从使用立体数据中受益。通过在不同数据集混合上的实验，我们证明了基于GP$^{2}$训练的模型在精度和速度上均优于依赖PCM的方法，并在通用几何保持SVDE中报告了最先进的结果。此外，我们表明，即使几何完整数据仅占训练集的次要部分，SVDE模型也能学习预测几何正确的深度。