This work presents a generalizable framework to transfer relative depth to metric depth. Current monocular depth estimation methods are mainly divided into metric depth estimation (MMDE) and relative depth estimation (MRDE). MMDEs estimate depth in metric scale but are often limited to a specific domain. MRDEs generalize well across different domains, but with uncertain scales which hinders downstream applications. To this end, we aim to build up a framework to solve scale uncertainty and transfer relative depth to metric depth. Previous methods used language as input and estimated two factors for conducting rescaling. Our approach, TR2M, utilizes both text description and image as inputs and estimates two rescale maps to transfer relative depth to metric depth at pixel level. Features from two modalities are fused with a cross-modality attention module to better capture scale information. A strategy is designed to construct and filter confident pseudo metric depth for more comprehensive supervision. We also develop scale-oriented contrastive learning to utilize depth distribution as guidance to enforce the model learning about intrinsic knowledge aligning with the scale distribution. TR2M only exploits a small number of trainable parameters to train on datasets in various domains and experiments not only demonstrate TR2M's great performance in seen datasets but also reveal superior zero-shot capabilities on five unseen datasets. We show the huge potential in pixel-wise transferring relative depth to metric depth with language assistance. (Code is available at: https://github.com/BeileiCui/TR2M)
翻译:本文提出了一种可泛化的框架,用于将相对深度转换为度量深度。当前的单目深度估计方法主要分为度量深度估计(MMDE)和相对深度估计(MRDE)。MMDE以度量尺度估计深度,但通常局限于特定领域。MRDE在不同领域间泛化能力强,但存在尺度不确定性,阻碍了下游应用。为此,我们旨在构建一个框架来解决尺度不确定性,并将相对深度转换为度量深度。先前的方法使用语言作为输入,并估计两个因子进行尺度重调整。我们的方法TR2M同时利用文本描述和图像作为输入,并估计两个重调整映射,在像素级别将相对深度转换为度量深度。来自两种模态的特征通过跨模态注意力模块进行融合,以更好地捕捉尺度信息。我们设计了一种策略来构建并筛选置信度高的伪度量深度,以实现更全面的监督。我们还开发了尺度导向的对比学习,利用深度分布作为指导,强制模型学习与尺度分布一致的内在知识。TR2M仅利用少量可训练参数,在多个领域的数据集上进行训练。实验不仅证明了TR2M在已见数据集上的优异性能,还揭示了其在五个未见数据集上卓越的零样本能力。我们展示了在语言辅助下,以像素级精度将相对深度转换为度量深度的巨大潜力。(代码发布于:https://github.com/BeileiCui/TR2M)