Depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is absent. Whereas impressive performances have been reported in this area recently using end-to-end trained deep neural architectures, as to what cues in the images that are being exploited by these black box systems is hard to know. To this end, in this work, we quantify the relative contributions of the known cues of depth in a monocular depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, colour and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimise depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation. The project code is attached to the supplementary material and will be published on GitHub.
翻译:从单一图像进行深度估计是计算机视觉中的一个挑战性问题,因为双目视差或运动信息缺失。尽管近年来使用端到端训练的深度神经网络架构在此领域取得了令人瞩目的性能,但这些黑箱系统究竟利用了图像中的哪些线索却难以得知。为此,在本工作中,我们基于室内场景数据集,量化了单目深度估计中已知深度线索的相对贡献。本研究采用特征提取技术,将形状、纹理、颜色和饱和度等单一特征分别提取出来,用于预测深度。我们发现,在所考虑的室内场景中,通过边缘检测提取的物体形状特征贡献显著大于其他特征,而其他特征也以不同程度发挥作用。这些见解将有助于优化深度估计模型,提升其准确性和鲁棒性,并有望拓展基于视觉的深度估计的实际应用范围。项目代码随补充材料附上,并将在GitHub上发布。