It has long been an ill-posed problem to predict absolute depth maps from single images in real (unseen) indoor scenes. We observe that it is essentially due to not only the scale-ambiguous problem but also the focal-ambiguous problem that decreases the generalization ability of monocular depth estimation. That is, images may be captured by cameras of different focal lengths in scenes of different scales. In this paper, we develop a focal-and-scale depth estimation model to well learn absolute depth maps from single images in unseen indoor scenes. First, a relative depth estimation network is adopted to learn relative depths from single images with diverse scales/semantics. Second, multi-scale features are generated by mapping a single focal length value to focal length features and concatenating them with intermediate features of different scales in relative depth estimation. Finally, relative depths and multi-scale features are jointly fed into an absolute depth estimation network. In addition, a new pipeline is developed to augment the diversity of focal lengths of public datasets, which are often captured with cameras of the same or similar focal lengths. Our model is trained on augmented NYUDv2 and tested on three unseen datasets. Our model considerably improves the generalization ability of depth estimation by 41%/13% (RMSE) with/without data augmentation compared with five recent SOTAs and well alleviates the deformation problem in 3D reconstruction. Notably, our model well maintains the accuracy of depth estimation on original NYUDv2.
翻译:从单张图像中预测真实(未见)室内场景的绝对深度图一直是一个不适定问题。我们观察到,这本质上不仅是由尺度模糊问题导致,还源于焦距模糊问题,二者共同降低了单目深度估计的泛化能力。即,不同焦距的相机可能在不同尺度的场景中采集图像。本文提出了一种焦距-尺度深度估计模型,以有效学习从未见室内场景单张图像中获取绝对深度图。首先,采用相对深度估计网络从具有多样尺度/语义的单张图像中学习相对深度;其次,通过将单个焦距值映射为焦距特征,并将其与相对深度估计中不同尺度的中间特征拼接,生成多尺度特征;最后,将相对深度与多尺度特征共同输入绝对深度估计网络。此外,本文开发了一种新流程以增强公开数据集的焦距多样性(这些数据集通常由相同或相近焦距的相机采集)。模型在增强后的NYUDv2数据集上训练,并在三个未见数据集上测试。与五种最新SOTA方法相比,我们的模型在有无数据增强时分别将深度估计的泛化能力显著提升了41%/13%(RMSE),并有效缓解了三维重建中的形变问题。值得注意的是,该模型在原始NYUDv2数据集上仍保持了深度估计的准确性。