Pre-trained Vision-Language Models (VLMs), such as CLIP, have shown enhanced performance across a range of tasks that involve the integration of visual and linguistic modalities. When CLIP is used for depth estimation tasks, the patches, divided from the input images, can be combined with a series of semantic descriptions of the depth information to obtain similarity results. The coarse estimation of depth is then achieved by weighting and summing the depth values, called depth bins, corresponding to the predefined semantic descriptions. The zero-shot approach circumvents the computational and time-intensive nature of traditional fully-supervised depth estimation methods. However, this method, utilizing fixed depth bins, may not effectively generalize as images from different scenes may exhibit distinct depth distributions. To address this challenge, we propose a few-shot-based method which learns to adapt the VLMs for monocular depth estimation to balance training costs and generalization capabilities. Specifically, it assigns different depth bins for different scenes, which can be selected by the model during inference. Additionally, we incorporate learnable prompts to preprocess the input text to convert the easily human-understood text into easily model-understood vectors and further enhance the performance. With only one image per scene for training, our extensive experiment results on the NYU V2 and KITTI dataset demonstrate that our method outperforms the previous state-of-the-art method by up to 10.6\% in terms of MARE.
翻译:预训练的视觉-语言模型(如CLIP)在涉及视觉与语言模态融合的多种任务中展现出增强性能。当CLIP应用于深度估计任务时,从输入图像分割的图像块可与一系列深度信息的语义描述相结合以获取相似性结果。随后,通过对预定义语义描述对应的深度值(称为深度箱)进行加权求和,即可得到粗粒度深度估计。这种零样本方法规避了传统全监督深度估计方法的高计算成本与长时间需求。然而,该方法采用固定的深度箱,可能无法有效泛化——不同场景的图像往往呈现不同的深度分布。为应对这一挑战,我们提出一种基于少样本的方法,通过学习自适应调整视觉-语言模型进行单目深度估计,以平衡训练成本与泛化能力。具体而言,该方法为不同场景分配不同的深度箱,模型可在推理过程中自主选择。此外,我们引入可学习的提示对输入文本进行预处理,将易于人类理解的文本转换为易于模型理解的向量,从而进一步提升性能。在仅使用每个场景单张图像进行训练的条件下,我们在NYU V2和KITTI数据集上的大量实验结果表明,我们的方法在MARE指标上相较于现有最优方法实现了高达10.6%的性能提升。