Semantic occupancy has recently gained significant traction as a prominent method for 3D scene representation. However, most existing camera-based methods rely on costly datasets with fine-grained 3D voxel labels or LiDAR scans for training, which limits their practicality and scalability, raising the need for self-supervised approaches in this domain. Moreover, most methods are tied to a predefined set of classes which they can detect. In this work we present a novel approach for open vocabulary occupancy estimation called \textit{LangOcc}, that is trained only via camera images, and can detect arbitrary semantics via vision-language alignment. In particular, we distill the knowledge of the strong vision-language aligned encoder CLIP into a 3D occupancy model via differentiable volume rendering. Our model estimates vision-language aligned features in a 3D voxel grid using only images. It is trained in a self-supervised manner by rendering our estimations back to 2D space, where ground-truth features can be computed. This training mechanism automatically supervises the scene geometry, allowing for a straight-forward and powerful training method without any explicit geometry supervision. LangOcc outperforms LiDAR-supervised competitors in open vocabulary occupancy by a large margin, solely relying on vision-based training. We also achieve state-of-the-art results in self-supervised semantic occupancy estimation on the Occ3D-nuScenes dataset, despite not being limited to a specific set of categories, thus demonstrating the effectiveness of our proposed vision-language training.
翻译:语义占据作为三维场景表示的重要方法,近期受到广泛关注。然而,现有基于相机的方法大多依赖带有精细三维体素标注或激光雷达扫描的高成本数据集进行训练,这限制了其实用性与可扩展性,凸显了该领域对自监督方法的需求。此外,多数方法仅能检测预定义类别集合中的对象。本文提出一种名为 \textit{LangOcc} 的开放词汇占据估计新方法,该方法仅通过相机图像进行训练,并能够通过视觉-语言对齐检测任意语义。具体而言,我们通过可微分体渲染将强视觉-语言对齐编码器 CLIP 的知识蒸馏至三维占据模型中。我们的模型仅使用图像即可在三维体素网格中估计视觉-语言对齐特征。模型通过将估计结果渲染回二维空间进行自监督训练,在二维空间中可计算真实特征。这种训练机制自动监督场景几何结构,无需任何显式几何监督即可实现直接而高效的训练方法。LangOcc 在开放词汇占据任务上大幅超越基于激光雷达监督的竞争方法,且仅依赖视觉训练。在 Occ3D-nuScenes 数据集上,我们在自监督语义占据估计中取得了最先进的结果,同时不受特定类别集合的限制,从而验证了所提出的视觉-语言训练方法的有效性。