Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.
翻译:空间智能在视觉语言模型中因实际场景中需进行三维世界推理而引发研究关注。尽管现有方法取得了显著进展,但其大多沿用传统2D视觉语言模型范式,使用像素对齐表征作为视觉模态的输入。然而,基于隐式3D场景理解的对应关系建模方法难以实现空间一致性,而引入3D几何先验的表征建模方法在视觉序列序列化中缺乏效率。为此,我们提出Proxy3D方法,为视觉模态构建紧凑而全面的3D代理表征。仅以视频帧为输入,我们采用语义编码器与几何编码器提取场景特征,进而通过语义感知聚类在三维空间中获得代理点集。针对表征对齐问题,我们进一步构建SpaceSpan数据集,并通过多阶段训练将所提出的3D代理表征与视觉语言模型适配。在采用更短视觉序列的情况下,本方法在三维视觉问答、视觉定位及通用空间智能基准测试中达到具有竞争力或最优性能。