Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for semantic understanding but inherently struggles to perceive geometric structure. Existing methods attempt to bridge this gap by querying CLIP with textual prompts, a process that is often indirect and inefficient. This paper introduces a fundamentally different approach using a dual-pathway decoder. We present SPACE-CLIP, an architecture that unlocks and interprets latent geometric knowledge directly from a frozen CLIP vision encoder, completely bypassing the text encoder and its associated textual prompts. A semantic pathway interprets high-level features, dynamically conditioned on global context using feature-wise linear modulation (FiLM). In addition, a structural pathway extracts fine-grained spatial details from early layers. These complementary streams are hierarchically fused, enabling a robust synthesis of semantic context and precise geometry. Extensive experiments on the KITTI benchmark show that SPACE-CLIP dramatically outperforms previous CLIP-based methods. Our ablation studies validate that the synergistic fusion of our dual pathways is critical to this success. SPACE-CLIP offers a new, efficient, and architecturally elegant blueprint for repurposing large-scale vision models. The proposed method is not just a standalone depth estimator, but a readily integrable spatial perception module for the next generation of embodied AI systems, such as vision-language-action (VLA) models. Our model is available at https://github.com/taewan2002/space-clip
翻译:对比语言-图像预训练(CLIP)在语义理解方面取得了非凡的成功,但其本质上难以感知几何结构。现有方法试图通过文本提示查询CLIP来弥合这一差距,但这一过程往往间接且低效。本文提出了一种根本不同的方法,采用双路径解码器架构。我们提出了SPACE-CLIP,该架构能够直接从冻结的CLIP视觉编码器中解锁并解释潜在的几何知识,完全绕过了文本编码器及其相关的文本提示。语义路径负责解释高级特征,并利用特征级线性调制(FiLM)根据全局上下文进行动态调节。此外,结构路径从早期层中提取细粒度的空间细节。这两条互补的路径通过分层融合,实现了语义上下文与精确几何结构的稳健合成。在KITTI基准测试上进行的大量实验表明,SPACE-CLIP显著优于以往基于CLIP的方法。我们的消融研究证实,双路径的协同融合是取得这一成功的关键。SPACE-CLIP为重新利用大规模视觉模型提供了一个新颖、高效且架构优雅的蓝图。所提出的方法不仅是一个独立的深度估计器,更是一个易于集成的空间感知模块,适用于下一代具身人工智能系统,例如视觉-语言-动作(VLA)模型。我们的模型发布于 https://github.com/taewan2002/space-clip