SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation

Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for semantic understanding but inherently struggles to perceive geometric structure. Existing methods attempt to bridge this gap by querying CLIP with textual prompts, a process that is often indirect and inefficient. This paper introduces a fundamentally different approach using a dual-pathway decoder. We present SPACE-CLIP, an architecture that unlocks and interprets latent geometric knowledge directly from a frozen CLIP vision encoder, completely bypassing the text encoder and its associated textual prompts. A semantic pathway interprets high-level features, dynamically conditioned on global context using feature-wise linear modulation (FiLM). In addition, a structural pathway extracts fine-grained spatial details from early layers. These complementary streams are hierarchically fused, enabling a robust synthesis of semantic context and precise geometry. Extensive experiments on the KITTI benchmark show that SPACE-CLIP dramatically outperforms previous CLIP-based methods. Our ablation studies validate that the synergistic fusion of our dual pathways is critical to this success. SPACE-CLIP offers a new, efficient, and architecturally elegant blueprint for repurposing large-scale vision models. The proposed method is not just a standalone depth estimator, but a readily integrable spatial perception module for the next generation of embodied AI systems, such as vision-language-action (VLA) models. Our model is available at https://github.com/taewan2002/space-clip

翻译：对比语言-图像预训练（CLIP）在语义理解方面取得了非凡的成功，但其本质上难以感知几何结构。现有方法试图通过文本提示查询CLIP来弥合这一差距，但这一过程往往间接且低效。本文提出了一种根本不同的方法，采用双路径解码器架构。我们提出了SPACE-CLIP，该架构能够直接从冻结的CLIP视觉编码器中解锁并解释潜在的几何知识，完全绕过了文本编码器及其相关的文本提示。语义路径负责解释高级特征，并利用特征级线性调制（FiLM）根据全局上下文进行动态调节。此外，结构路径从早期层中提取细粒度的空间细节。这两条互补的路径通过分层融合，实现了语义上下文与精确几何结构的稳健合成。在KITTI基准测试上进行的大量实验表明，SPACE-CLIP显著优于以往基于CLIP的方法。我们的消融研究证实，双路径的协同融合是取得这一成功的关键。SPACE-CLIP为重新利用大规模视觉模型提供了一个新颖、高效且架构优雅的蓝图。所提出的方法不仅是一个独立的深度估计器，更是一个易于集成的空间感知模块，适用于下一代具身人工智能系统，例如视觉-语言-动作（VLA）模型。我们的模型发布于 https://github.com/taewan2002/space-clip