Semantic occupancy prediction aims to infer dense geometry and semantics of surroundings for an autonomous agent to operate safely in the 3D environment. Existing occupancy prediction methods are almost entirely trained on human-annotated volumetric data. Although of high quality, the generation of such 3D annotations is laborious and costly, restricting them to a few specific object categories in the training dataset. To address this limitation, this paper proposes Open Vocabulary Occupancy (OVO), a novel approach that allows semantic occupancy prediction of arbitrary classes but without the need for 3D annotations during training. Keys to our approach are (1) knowledge distillation from a pre-trained 2D open-vocabulary segmentation model to the 3D occupancy network, and (2) pixel-voxel filtering for high-quality training data generation. The resulting framework is simple, compact, and compatible with most state-of-the-art semantic occupancy prediction models. On NYUv2 and SemanticKITTI datasets, OVO achieves competitive performance compared to supervised semantic occupancy prediction approaches. Furthermore, we conduct extensive analyses and ablation studies to offer insights into the design of the proposed framework.
翻译:语义占据预测旨在推断自动驾驶汽车在三维环境中安全运行所需的周围密集几何结构与语义信息。现有占据预测方法几乎完全依赖人工标注的体素数据进行训练。尽管标注质量较高,但生成此类三维标注费时费力,导致训练数据仅覆盖少数特定物体类别。为解决这一局限,本文提出开放词汇占据预测(OVO),这是一种新颖方法,可在训练中无需三维标注的情况下,对任意类别的语义占据进行预测。该方法的两个关键要素为:(1)从预训练的二维开放词汇分割模型向三维占据网络进行知识蒸馏;(2)采用像素-体素滤波生成高质量训练数据。所生成的框架简洁紧凑,并与多数先进语义占据预测模型兼容。在NYUv2与SemanticKITTI数据集上,OVO取得了与监督式语义占据预测方法相竞争的性能。此外,我们通过广泛分析与消融实验,为提出框架的设计提供了深入见解。