The estimation of depth in two-dimensional images has long been a challenging and extensively studied subject in computer vision. Recently, significant progress has been made with the emergence of Deep Learning-based approaches, which have proven highly successful. This paper focuses on the explainability in monocular depth estimation methods, in terms of how humans perceive depth. This preliminary study emphasizes on one of the most significant visual cues, the relative size, which is prominent in almost all viewed images. We designed a specific experiment to mimic the experiments in humans and have tested state-of-the-art methods to indirectly assess the explainability in the context defined. In addition, we observed that measuring the accuracy required further attention and a particular approach is proposed to this end. The results show that a mean accuracy of around 77% across methods is achieved, with some of the methods performing markedly better, thus, indirectly revealing their corresponding potential to uncover monocular depth cues, like relative size.
翻译:二维图像深度估计一直是计算机视觉领域极具挑战性且研究广泛的主题。近年来,随着基于深度学习方法的出现,取得了显著进展并证明了其高度有效性。本文聚焦于单目深度估计方法的可解释性,从人类感知深度的角度展开研究。本项初步研究重点关注最显著的视觉线索之一——相对尺寸,该线索在几乎所有观测图像中均占据主导地位。我们设计了与人类实验相仿的特定实验,对当前最优方法进行了测试,以间接评估所定义语境下的可解释性。此外,我们观察到准确率测量需要进一步关注,并为此提出了一种特定方法。结果表明,各方法的平均准确率约为77%,部分方法表现明显更优,从而间接揭示了它们发掘单目深度线索(如相对尺寸)的相应潜力。