Accurate depth maps are essential in various applications, such as autonomous driving, scene reconstruction, point-cloud creation, etc. However, monocular-depth estimation (MDE) algorithms often fail to provide enough texture & sharpness, and also are inconsistent for homogeneous scenes. These algorithms mostly use CNN or vision transformer-based architectures requiring large datasets for supervised training. But, MDE algorithms trained on available depth datasets do not generalize well and hence fail to perform accurately in diverse real-world scenes. Moreover, the ground-truth depth maps are either lower resolution or sparse leading to relatively inconsistent depth maps. In general, acquiring a high-resolution ground truth dataset with pixel-level precision for accurate depth prediction is an expensive, and time-consuming challenge. In this paper, we generate a high-resolution synthetic depth dataset (HRSD) of dimension 1920 X 1080 from Grand Theft Auto (GTA-V), which contains 100,000 color images and corresponding dense ground truth depth maps. The generated datasets are diverse and have scenes from indoors to outdoors, from homogeneous surfaces to textures. For experiments and analysis, we train the DPT algorithm, a state-of-the-art transformer-based MDE algorithm on the proposed synthetic dataset, which significantly increases the accuracy of depth maps on different scenes by 9 %. Since the synthetic datasets are of higher resolution, we propose adding a feature extraction module in the transformer encoder and incorporating an attention-based loss, further improving the accuracy by 15 %.
翻译:精确的深度图在自动驾驶、场景重建、点云生成等诸多应用中至关重要。然而,单目深度估计算法往往无法提供足够的纹理与锐度,且在均匀场景中表现不一致。这些算法大多采用基于CNN或视觉Transformer的架构,需要大规模数据集进行监督训练。但基于现有深度数据集训练的单目深度估计算法泛化能力较差,难以在多样化的真实场景中精确执行。此外,真实深度图要么分辨率较低,要么稀疏,导致生成的深度图相对不一致。通常,获取具有像素级精度的高分辨率真实深度数据集用于精确深度预测是一项昂贵且耗时的挑战。本文从《侠盗猎车手V》(GTA-V)中生成维度为1920×1080的高分辨率合成深度数据集(HRSD),包含100,000张彩色图像及对应的密集真实深度图。生成的数据集具有多样性,涵盖从室内到室外、从均匀表面到纹理的场景。为进行实验与分析,我们在所提出的合成数据集上训练了基于Transformer的最先进的单目深度估计算法DPT,该方法在不同场景下将深度图的精度显著提升了9%。由于合成数据集具有更高分辨率,我们提出在Transformer编码器中添加特征提取模块,并引入基于注意力的损失函数,从而进一步将精度提升15%。