Monocular Semantic Occupancy Prediction aims to infer the complete 3D geometry and semantic information of scenes from only 2D images. It has garnered significant attention, particularly due to its potential to enhance the 3D perception of autonomous vehicles. However, existing methods rely on a complex cascaded framework with relatively limited information to restore 3D scenes, including a dependency on supervision solely on the whole network's output, single-frame input, and the utilization of a small backbone. These challenges, in turn, hinder the optimization of the framework and yield inferior prediction results, particularly concerning smaller and long-tailed objects. To address these issues, we propose MonoOcc. In particular, we (i) improve the monocular occupancy prediction framework by proposing an auxiliary semantic loss as supervision to the shallow layers of the framework and an image-conditioned cross-attention module to refine voxel features with visual clues, and (ii) employ a distillation module that transfers temporal information and richer knowledge from a larger image backbone to the monocular semantic occupancy prediction framework with low cost of hardware. With these advantages, our method yields state-of-the-art performance on the camera-based SemanticKITTI Scene Completion benchmark. Codes and models can be accessed at https://github.com/ucaszyp/MonoOcc
翻译:单目语义占据预测旨在仅从二维图像中推断场景的完整三维几何与语义信息。因其在增强自动驾驶车辆三维感知方面的潜力而备受关注。然而,现有方法依赖复杂级联框架,且用于恢复三维场景的信息相对有限,包括仅依赖网络整体输出的监督、单帧输入以及使用小型骨干网络。这些挑战阻碍了框架的优化,导致预测结果不佳,尤其针对小型和长尾目标。为解决这些问题,我们提出MonoOcc。具体而言,我们(i)通过提出辅助语义损失对框架浅层进行监督,以及设计图像条件交叉注意力模块利用视觉线索细化体素特征,改进了单目占据预测框架;(ii)采用蒸馏模块,以较低的硬件成本将时间信息与更大图像骨干网络的丰富知识迁移至单目语义占据预测框架。凭借这些优势,我们的方法在基于相机的SemanticKITTI场景补全基准测试中取得了最先进性能。代码与模型可访问https://github.com/ucaszyp/MonoOcc获取。