Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.
翻译:利用视觉-语言模型(如CLIP)丰富的语义特征进行单目深度估计任务是一个有前景的方向,但通常需要大量微调或缺乏几何精度。我们提出一个参数高效的框架MoA-DepthCLIP,通过最小监督将预训练的CLIP表示适配于单目深度估计。本方法将轻量级混合适配器模块集成到预训练的Vision Transformer(ViT-B/32)骨干中,并结合最终层的选择性微调。该设计通过全局语义上下文向量和混合预测架构实现空间感知适配,该架构协同深度箱分类与直接回归。为增强结构精度,我们采用一种复合损失函数以强制几何约束。在NYU Depth V2基准测试中,MoA-DepthCLIP取得了有竞争力的结果,显著优于DepthCLIP基线,将$δ_1$精度从0.390提升至0.745,并将RMSE从1.176降低至0.520。这些结果在需要极少先训练参数的情况下实现,表明轻量级提示引导的MoA是将视觉-语言模型知识迁移至细粒度单目深度估计任务的高效策略。