Depth estimation from 2D images is a common computer vision task that has applications in many fields including autonomous vehicles, scene understanding and robotics. The accuracy of a supervised depth estimation method mainly relies on the chosen loss function, the model architecture, quality of data and performance metrics. In this study, we propose a simplified and adaptable approach to improve depth estimation accuracy using transfer learning and an optimized loss function. The optimized loss function is a combination of weighted losses to which enhance robustness and generalization: Mean Absolute Error (MAE), Edge Loss and Structural Similarity Index (SSIM). We use a grid search and a random search method to find optimized weights for the losses, which leads to an improved model. We explore multiple encoder-decoder-based models including DenseNet121, DenseNet169, DenseNet201, and EfficientNet for the supervised depth estimation model on NYU Depth Dataset v2. We observe that the EfficientNet model, pre-trained on ImageNet for classification when used as an encoder, with a simple upsampling decoder, gives the best results in terms of RSME, REL and log10: 0.386, 0.113 and 0.049, respectively. We also perform a qualitative analysis which illustrates that our model produces depth maps that closely resemble ground truth, even in cases where the ground truth is flawed. The results indicate significant improvements in accuracy and robustness, with EfficientNet being the most successful architecture.
翻译:从二维图像进行深度估计是一项常见的计算机视觉任务,广泛应用于自动驾驶、场景理解与机器人等领域。有监督深度估计方法的精度主要取决于所选损失函数、模型架构、数据质量及性能指标。本研究提出一种简化且可调整的方法,通过迁移学习与优化的损失函数提升深度估计精度。该优化损失函数结合了平均绝对误差(MAE)、边缘损失与结构相似性指数(SSIM)的加权损失,增强了鲁棒性与泛化能力。我们采用网格搜索与随机搜索方法寻找损失函数的最优权重,从而改进模型性能。在NYU Depth Dataset v2数据集上,我们探索了多种基于编码器-解码器的模型,包括DenseNet121、DenseNet169、DenseNet201及EfficientNet。实验发现,采用在ImageNet上预训练用于分类的EfficientNet作为编码器,配合简单的上采样解码器,在RSME、REL和log10指标上取得最佳结果,分别为0.386、0.113和0.049。定性分析表明,即使在地面真值存在缺陷的情况下,模型生成的深度图仍与真实结果高度吻合。结果表明,所提方法在精度与鲁棒性上均有显著提升,其中EfficientNet架构表现最优。