基于相机的三维语义场景补全中体素稀疏性的多分辨率对齐方法 (Multi-Resolution Alignment for Voxel Sparsity in Camera-Based 3D Semantic Scene Completion)

Camera-based 3D semantic scene completion (SSC) offers a cost-effective solution for assessing the geometric occupancy and semantic labels of each voxel in the surrounding 3D scene with image inputs, providing a voxel-level scene perception foundation for the perception-prediction-planning autonomous driving systems. Although significant progress has been made in existing methods, their optimization rely solely on the supervision from voxel labels and face the challenge of voxel sparsity as a large portion of voxels in autonomous driving scenarios are empty, which limits both optimization efficiency and model performance. To address this issue, we propose a \textit{Multi-Resolution Alignment (MRA)} approach to mitigate voxel sparsity in camera-based 3D semantic scene completion, which exploits the scene and instance level alignment across multi-resolution 3D features as auxiliary supervision. Specifically, we first propose the Multi-resolution View Transformer module, which projects 2D image features into multi-resolution 3D features and aligns them at the scene level through fusing discriminative seed features. Furthermore, we design the Cubic Semantic Anisotropy module to identify the instance-level semantic significance of each voxel, accounting for the semantic differences of a specific voxel against its neighboring voxels within a cubic area. Finally, we devise a Critical Distribution Alignment module, which selects critical voxels as instance-level anchors with the guidance of cubic semantic anisotropy, and applies a circulated loss for auxiliary supervision on the critical feature distribution consistency across different resolutions. The code is available at https://github.com/PKU-ICST-MIPL/MRA_TIP.

翻译：基于相机的三维语义场景补全（SSC）提供了一种经济高效的解决方案，能够通过图像输入评估周围三维场景中每个体素的几何占据情况和语义标签，为感知-预测-规划自动驾驶系统提供体素级的场景感知基础。尽管现有方法已取得显著进展，但其优化仅依赖于体素标签的监督，并面临体素稀疏性的挑战——自动驾驶场景中大部分体素为空，这限制了优化效率和模型性能。为解决此问题，我们提出一种\textit{多分辨率对齐（MRA）}方法，以缓解基于相机的三维语义场景补全中的体素稀疏性问题。该方法利用多分辨率三维特征间的场景级和实例级对齐作为辅助监督。具体而言，我们首先提出多分辨率视图变换器模块，将二维图像特征投影为多分辨率三维特征，并通过融合判别性种子特征在场景级对齐它们。此外，我们设计了立方语义各向异性模块，以识别每个体素的实例级语义显著性，该模块考虑特定体素在立方区域内与其相邻体素的语义差异。最后，我们设计了关键分布对齐模块，该模块在立方语义各向异性的指导下选择关键体素作为实例级锚点，并应用循环损失对不同分辨率间的关键特征分布一致性进行辅助监督。代码发布于 https://github.com/PKU-ICST-MIPL/MRA_TIP。