Visual-based 3D semantic occupancy perception is a key technology for robotics, including autonomous vehicles, offering an enhanced understanding of the environment by 3D. This approach, however, typically requires more computational resources than BEV or 2D methods. We propose a novel 3D semantic occupancy perception method, OccupancyDETR, which utilizes a DETR-like object detection, a mixed dense-sparse 3D occupancy decoder. Our approach distinguishes between foreground and background within a scene. Initially, foreground objects are detected using the DETR-like object detection. Subsequently, queries for both foreground and background objects are fed into the mixed dense-sparse 3D occupancy decoder, performing upsampling in dense and sparse methods, respectively. Finally, a MaskFormer is utilized to infer the semantics of the background voxels. Our approach strikes a balance between efficiency and accuracy, achieving faster inference times, lower resource consumption, and improved performance for small object detection. We demonstrate the effectiveness of our proposed method on the SemanticKITTI dataset, showcasing an mIoU of 14 and a processing speed of 10 FPS, thereby presenting a promising solution for real-time 3D semantic occupancy perception.
翻译:基于视觉的三维语义占据感知是包括自动驾驶在内的机器人领域的一项关键技术,它能通过三维方式增强对环境的理解。然而,与BEV或二维方法相比,该方法通常需要更多的计算资源。我们提出了一种新颖的三维语义占据感知方法OccupancyDETR,它采用类似DETR的目标检测模块和混合密集-稀疏三维占据解码器。我们的方法区分场景中的前景与背景。首先,利用类似DETR的目标检测模块检测前景物体。随后,将前景和背景的查询向量分别输入混合密集-稀疏三维占据解码器,分别以密集和稀疏方式进行上采样。最后,使用MaskFormer推断背景体素的语义信息。我们的方法在效率与准确率之间取得了平衡,实现了更快的推理速度、更低的资源消耗,并提升了对小物体的检测性能。我们在SemanticKITTI数据集上验证了所提方法的有效性,展示了14.0的平均交并比(mIoU)和10 FPS的处理速度,为实时三维语义占据感知提供了一种有前景的解决方案。