Estimating depth from a single 2D image is a challenging task due to the lack of stereo or multi-view data, which are typically required for depth perception. This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture, where the Inception-ResNet-v2 model serves as the encoder. This is the first instance of utilizing Inception-ResNet-v2 as an encoder for monocular depth estimation, demonstrating improved performance over previous models. Our model effectively captures complex objects and fine-grained details, which are generally difficult to predict. Additionally, it incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances. We propose a composite loss function comprising depth loss, gradient edge loss, and Structural Similarity Index Measure (SSIM) loss, with fine-tuned weights to optimize the weighted sum, ensuring a balance across different aspects of depth estimation. Experimental results on the NYU Depth V2 dataset show that our model achieves state-of-the-art performance, with an Absolute Relative Error (ARE) of 0.064, Root Mean Square Error (RMSE) of 0.228, and accuracy ($\delta$ < 1.25) of 89.3%. These metrics demonstrate that our model can accurately predict depth even in challenging scenarios, providing a scalable solution for real-world applications in robotics, 3D reconstruction, and augmented reality.
翻译:从单张二维图像估计深度是一项具有挑战性的任务,因为缺乏通常用于深度感知的立体或多视图数据。本文提出了一种基于深度学习的新方法,采用增强型编码器-解码器架构,其中以Inception-ResNet-v2模型作为编码器。这是首次将Inception-ResNet-v2用作单目深度估计的编码器,其性能优于以往模型。我们的模型能有效捕捉复杂物体和细粒度细节,这些通常难以预测。此外,该模型结合了多尺度特征提取,以提升对不同尺寸和距离物体的深度预测精度。我们提出了一种复合损失函数,包含深度损失、梯度边缘损失和结构相似性指数(SSIM)损失,并通过微调权重来优化加权和,确保深度估计不同方面的平衡。在NYU Depth V2数据集上的实验结果表明,我们的模型实现了最先进的性能,其绝对相对误差(ARE)为0.064,均方根误差(RMSE)为0.228,精度($\delta$ < 1.25)达到89.3%。这些指标表明,即使在具有挑战性的场景中,我们的模型也能准确预测深度,为机器人、三维重建和增强现实等实际应用提供了可扩展的解决方案。