We introduce SLCF-Net, a novel approach for the Semantic Scene Completion (SSC) task that sequentially fuses LiDAR and camera data. It jointly estimates missing geometry and semantics in a scene from sequences of RGB images and sparse LiDAR measurements. The images are semantically segmented by a pre-trained 2D U-Net and a dense depth prior is estimated from a depth-conditioned pipeline fueled by Depth Anything. To associate the 2D image features with the 3D scene volume, we introduce Gaussian-decay Depth-prior Projection (GDP). This module projects the 2D features into the 3D volume along the line of sight with a Gaussian-decay function, centered around the depth prior. Volumetric semantics is computed by a 3D U-Net. We propagate the hidden 3D U-Net state using the sensor motion and design a novel loss to ensure temporal consistency. We evaluate our approach on the SemanticKITTI dataset and compare it with leading SSC approaches. The SLCF-Net excels in all SSC metrics and shows great temporal consistency.
翻译:我们提出SLCF-Net,一种新颖的语义场景补全方法,通过序列化融合激光雷达与相机数据,从RGB图像序列和稀疏激光雷达测量中联合估计场景中缺失的几何与语义信息。图像由预训练的二维U-Net进行语义分割,并通过基于深度任意性(Depth Anything)驱动的深度条件化管道估计密集深度先验。为将二维图像特征与三维场景体素关联,我们引入高斯衰减深度先验投影模块:该模块沿视线方向,以深度先验为中心的高斯衰减函数将二维特征投影至三维体素空间。三维体素语义由三维U-Net计算,并通过传感器运动传播隐藏的三维U-Net状态,同时设计新型损失函数确保时序一致性。我们在SemanticKITTI数据集上评估方法并与主流语义场景补全方法对比。SLCF-Net在所有语义场景补全指标上表现优异,并展现出色的时序一致性。