Reconstructing dense 3D scenes from sparse LiDAR point clouds is a fundamental challenge in autonomous driving, where latent diffusion models offer a promising solution. However, existing approaches rely on object-level autoencoders that collapse into unstable global representations at outdoor scale and suffer from ground truth data corrupted by odometry drift that systematically degrades supervision quality. Furthermore, multi-step diffusion inference incurs prohibitive latency for real-time deployment. We propose a novel multi-token Gaussian VAE with cross-attention pooling for stable scene-scale LiDAR compression, combined with an anchor-based ICP ground truth refinement pipeline that eliminates drift-induced noise from training supervision. Together, these components enable a scaffold-free single-step diffusion completion model that achieves an approximately 16x reduction in squared Chamfer distance on SemanticKITTI seq. 08 (0.396 m^2 to 0.024 m^2), surpasses LiDiff and ScoreLiDAR by 17-19% and 10-11%, respectively, and operates at 25-143x lower inference latency. Our results demonstrate that data quality dominates model design in this regime and that multi-token latent spaces provide a stable first stage for latent diffusion-based scene completion.
翻译:从稀疏激光雷达点云重建稠密三维场景是自动驾驶领域的一项基础挑战,其中潜扩散模型提供了一种有前景的解决方案。然而,现有方法依赖的对象级自编码器在户外尺度会坍缩为不稳定的全局表征,且受制于因里程计漂移而损坏的真实数据,系统性地降低了监督质量。此外,多步扩散推理会导致实时部署面临难以承受的延迟。我们提出了一种新颖的多令牌高斯变分自编码器,结合交叉注意力池化实现稳定的场景级激光雷达压缩,并设计了一种基于锚点的迭代最近点真值优化流程,以消除训练监督中由漂移引入的噪声。这些组件共同支持了一种无需支架的单步扩散补全模型,在SemanticKITTI序列08上将平方倒角距离减少了约16倍(从0.396 m²降至0.024 m²),分别以17-19%和10-11%的优势超越LiDiff与ScoreLiDAR,且推理延迟降低25-143倍。我们的结果表明,在此场景下数据质量主导模型设计,而多令牌潜空间为基于潜扩散的场景补全提供了稳定的第一阶段基础。