Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.
翻译:在自动驾驶应用中,未标注的激光雷达日志本质上是一座隐藏在明处的稠密三维几何金矿——然而若缺乏人工标注,它们几乎毫无用处,这凸显了自动驾驶感知研究面临的主要成本障碍。本研究通过利用激光雷达扫描间的时空几何一致性,将来自文本和二维视觉基础模型的线索直接提升并融合至三维空间,无需任何人工输入,从而攻克了这一瓶颈。我们提出了一种基于从时序累积的激光雷达地图中学习的强几何先验的无监督多模态伪标注方法,同时引入一种新颖的迭代更新规则,该规则强制实现几何与语义的联合一致性,并反之通过不一致性检测运动物体。我们的方法能同步生成三维语义标签、三维边界框和稠密激光雷达扫描,在三个数据集上均展现出稳健的泛化能力。实验验证表明,相较于通常需要额外人工监督的现有语义分割与目标检测伪标注方法,我们的方法具有竞争优势。我们证实,即使仅使用少量经几何一致性处理并稠密化的激光雷达数据,也能在80-150米和150-250米距离范围内分别将深度预测的平均绝对误差降低51.5%和22.0%。