Occupancy prediction tasks focus on the inference of both geometry and semantic labels for each voxel, which is an important perception mission. However, it is still a semantic segmentation task without distinguishing various instances. Further, although some existing works, such as Open-Vocabulary Occupancy (OVO), have already solved the problem of open vocabulary detection, visual grounding in occupancy has not been solved to the best of our knowledge. To tackle the above two limitations, this paper proposes Occupancy Grounding (OG), a novel method that equips vanilla occupancy instance segmentation ability and could operate visual grounding in a voxel manner with the help of grounded-SAM. Keys to our approach are (1) affinity field prediction for instance clustering and (2) association strategy for aligning 2D instance masks and 3D occupancy instances. Extensive experiments have been conducted whose visualization results and analysis are shown below. Our code will be publicly released soon.
翻译:占用预测任务关注每个体素的几何与语义标签推理,是一项重要的感知任务。然而,它仍属于语义分割任务,无法区分不同实例。此外,尽管现有工作如开放词汇占用预测(OVO)已解决开放词汇检测问题,但据我们所知,占用预测中的视觉定位问题尚未得到解决。针对上述两个局限,本文提出占用定位(OG)这一新方法,该方法赋予基础占用实例分割能力,并借助grounded-SAM以体素方式实现视觉定位。本方法的关键在于:(1)用于实例聚类的亲和场预测;(2)用于对齐二维实例掩码与三维占用实例的关联策略。我们进行了大量实验,可视化结果与分析如下所示。相关代码将很快公开发布。