High-Resolution Synthetic RGB-D Datasets for Monocular Depth Estimation

Accurate depth maps are essential in various applications, such as autonomous driving, scene reconstruction, point-cloud creation, etc. However, monocular-depth estimation (MDE) algorithms often fail to provide enough texture & sharpness, and also are inconsistent for homogeneous scenes. These algorithms mostly use CNN or vision transformer-based architectures requiring large datasets for supervised training. But, MDE algorithms trained on available depth datasets do not generalize well and hence fail to perform accurately in diverse real-world scenes. Moreover, the ground-truth depth maps are either lower resolution or sparse leading to relatively inconsistent depth maps. In general, acquiring a high-resolution ground truth dataset with pixel-level precision for accurate depth prediction is an expensive, and time-consuming challenge. In this paper, we generate a high-resolution synthetic depth dataset (HRSD) of dimension 1920 X 1080 from Grand Theft Auto (GTA-V), which contains 100,000 color images and corresponding dense ground truth depth maps. The generated datasets are diverse and have scenes from indoors to outdoors, from homogeneous surfaces to textures. For experiments and analysis, we train the DPT algorithm, a state-of-the-art transformer-based MDE algorithm on the proposed synthetic dataset, which significantly increases the accuracy of depth maps on different scenes by 9 %. Since the synthetic datasets are of higher resolution, we propose adding a feature extraction module in the transformer encoder and incorporating an attention-based loss, further improving the accuracy by 15 %.

翻译：精确的深度图在自动驾驶、场景重建、点云生成等诸多应用中至关重要。然而，单目深度估计算法往往无法提供足够的纹理与锐度，且在均匀场景中表现不一致。这些算法大多采用基于CNN或视觉Transformer的架构，需要大规模数据集进行监督训练。但基于现有深度数据集训练的单目深度估计算法泛化能力较差，难以在多样化的真实场景中精确执行。此外，真实深度图要么分辨率较低，要么稀疏，导致生成的深度图相对不一致。通常，获取具有像素级精度的高分辨率真实深度数据集用于精确深度预测是一项昂贵且耗时的挑战。本文从《侠盗猎车手V》（GTA-V）中生成维度为1920×1080的高分辨率合成深度数据集（HRSD），包含100,000张彩色图像及对应的密集真实深度图。生成的数据集具有多样性，涵盖从室内到室外、从均匀表面到纹理的场景。为进行实验与分析，我们在所提出的合成数据集上训练了基于Transformer的最先进的单目深度估计算法DPT，该方法在不同场景下将深度图的精度显著提升了9%。由于合成数据集具有更高分辨率，我们提出在Transformer编码器中添加特征提取模块，并引入基于注意力的损失函数，从而进一步将精度提升15%。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR2022】自动驾驶中的伪双目三维目标检测，Pseudo-Stereo for Monocular 3D Object Detection in Autonomous Driving

专知会员服务

18+阅读 · 2022年3月19日