3D occupancy prediction based on multi-sensor fusion, crucial for a reliable autonomous driving system, enables fine-grained understanding of 3D scenes. Previous fusion-based 3D occupancy predictions relied on depth estimation for processing 2D image features. However, depth estimation is an ill-posed problem, hindering the accuracy and robustness of these methods. Furthermore, fine-grained occupancy prediction demands extensive computational resources. We introduce OccFusion, a multi-modal fusion method free from depth estimation, and a corresponding point cloud sampling algorithm for dense integration of image features. Building on this, we propose an active training method and an active coarse to fine pipeline, enabling the model to adaptively learn more from complex samples and optimize predictions specifically for challenging areas such as small or overlapping objects. The active methods we propose can be naturally extended to any occupancy prediction model. Experiments on the OpenOccupancy benchmark show our method surpasses existing state-of-the-art (SOTA) multi-modal methods in IoU across all categories. Additionally, our model is more efficient during both the training and inference phases, requiring far fewer computational resources. Comprehensive ablation studies demonstrate the effectiveness of our proposed techniques.
翻译:基于多传感器融合的三维占据预测是实现可靠自动驾驶系统的关键技术,能够实现对三维场景的细粒度理解。现有的融合式三维占据预测方法依赖深度估计处理二维图像特征,然而深度估计本身是病态问题,制约了此类方法的精度与鲁棒性。此外,细粒度占据预测需要大量计算资源。我们提出OccFusion——一种无需深度估计的多模态融合方法,并设计相应点云采样算法以实现图像特征的密集整合。在此基础之上,我们进一步提出主动训练方法与主动化由粗到精流程,使模型能够针对复杂样本进行自适应学习,并专门优化小物体或重叠物体等困难区域的预测。所提出的主动方法可自然扩展至任意占据预测模型。在OpenOccupancy基准上的实验表明,本方法在所有类别上的交并比(IoU)均超越现有最先进的多模态方法。同时,我们的模型在训练与推理阶段均具有更高效率,所需计算资源大幅减少。全面的消融研究验证了所提出技术的有效性。