3D semantic occupancy prediction is crucial for finely representing the surrounding environment, which is essential for ensuring the safety in autonomous driving. Existing fusion-based occupancy methods typically involve performing a 2D-to-3D view transformation on image features, followed by computationally intensive 3D operations to fuse these with LiDAR features, leading to high computational costs and reduced accuracy. Moreover, current research on occupancy prediction predominantly focuses on designing specific network architectures, often tailored to particular models, with limited attention given to the more fundamental aspect of semantic feature learning. This gap hinders the development of more transferable methods that could enhance the performance of various occupancy models. To address these challenges, we propose OccLoff, a framework that Learns to Optimize Feature Fusion for 3D occupancy prediction. Specifically, we introduce a sparse fusion encoder with entropy masks that directly fuses 3D and 2D features, improving model accuracy while reducing computational overhead. Additionally, we propose a transferable proxy-based loss function and an adaptive hard sample weighting algorithm, which enhance the performance of several state-of-the-art methods. Extensive evaluations on the nuScenes and SemanticKITTI benchmarks demonstrate the superiority of our framework, and ablation studies confirm the effectiveness of each proposed module.
翻译:三维语义占据预测对于精细表示周围环境至关重要,是确保自动驾驶安全的关键。现有的基于融合的占据预测方法通常涉及对图像特征执行二维到三维的视图变换,随后通过计算密集的三维操作将其与激光雷达特征融合,导致计算成本高昂且精度下降。此外,当前关于占据预测的研究主要集中于设计特定的网络架构,这些架构通常针对特定模型定制,而对更基础的语义特征学习方面关注有限。这一差距阻碍了开发更具可迁移性的方法,而这些方法本可以提升各种占据模型的性能。为应对这些挑战,我们提出了OccLoff框架,该框架学习优化特征融合以进行三维占据预测。具体而言,我们引入了一种带有熵掩码的稀疏融合编码器,能够直接融合三维与二维特征,在降低计算开销的同时提升模型精度。此外,我们提出了一种基于可迁移代理的损失函数和一种自适应困难样本加权算法,这些方法提升了多种先进模型的性能。在nuScenes和SemanticKITTI基准测试上的广泛评估证明了我们框架的优越性,消融研究也确认了每个所提出模块的有效性。