Recent studies on generalizing CLIP for monocular depth estimation reveal that CLIP pre-trained on web-crawled data is inefficient for deriving proper similarities between image patches and depth-related prompts. In this paper, we adapt CLIP for meaningful quality of monocular depth estimation with dense prediction, without fine-tuning its original vision-language alignment. By jointly training a compact deconvolutional decoder with a tiny learnable embedding matrix named mirror, as a static prompt for its text encoder, CLIP is enabled to understand depth. With this approach, our model exhibits impressive performance matching several previous state-of-the-art vision-only models on the NYU Depth v2 and KITTI datasets, outperforming every CLIP-based depth estimation model with a large margin. Experiments on temporal depth consistency and spatial continuity demonstrate that the prior knowledge of CLIP can be effectively refined by our proposed framework. Furthermore, an ablation study on mirror proves that the resulting model estimates depth utilizing knowledge not only from the image encoder but also text encoder despite not being given any prompt written in a human way. This research demonstrates that through minimal adjustments, the prior knowledge of vision-language foundation models, such as CLIP, can be generalized even to domains where learning during pretraining is challenging. We facilitate future works focused on methods to adjust suboptimal prior knowledge of vision-language models using non-human language prompts, achieving performance on par with task-specific state-of-the-art methodologies.
翻译:关于将CLIP泛化至单目深度估计的最新研究表明,在网页爬取数据上预训练的CLIP难以有效推导图像块与深度相关提示之间的相似性。本文在不微调其原始视觉语言对齐的前提下,通过联合训练紧凑的反卷积解码器与名为"镜像"的可学习嵌入矩阵(作为文本编码器的静态提示),使CLIP具备深度理解能力。该方法在NYU Depth v2和KITTI数据集上展现出与多个先前最优纯视觉模型匹配的出色性能,并以显著优势超越所有基于CLIP的深度估计模型。时序深度一致性与空间连续性实验证明,本文提出的框架能有效精炼CLIP的先验知识。进一步对"镜像"的消融研究表明,尽管未使用人类可读的文本提示,模型仍能同时利用图像编码器和文本编码器的知识进行深度估计。本研究证明,通过最小化调整,视觉语言基础模型(如CLIP)的先验知识可泛化至预训练阶段难以学习的领域。我们为未来研究开辟方向:探索使用非人类语言提示调整视觉语言模型次优先验知识的方法,使其达到与任务专用最优方法相媲美的性能。