Dense Bird's Eye View (BEV) semantic maps are central to autonomous driving, yet current multi-camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two-phase training strategy for fine-grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine-tuning while still outperforming the comparable supervised baseline model. During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi-view semantic pseudo-labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine-tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced-label autonomous perception.
翻译:密集鸟瞰图(BEV)语义地图是自动驾驶的核心,然而当前的多摄像头方法依赖于昂贵且标注不一致的BEV地面真值。我们提出一种用于细粒度道路标线分割的两阶段训练策略,以解决这一局限:该策略在预训练阶段移除了完全监督,在微调阶段将训练数据量减半,同时性能仍优于可比的监督基线模型。在自监督预训练阶段,BEVFormer的预测结果被可微分地重投影到图像平面,并使用广泛采用的语义分割模型Mask2Former生成的多视角语义伪标签进行训练。时序损失鼓励帧间一致性。随后的监督微调阶段仅需50%的数据集,并显著减少了训练时间。通过我们的方法,微调受益于预训练期间学到的丰富先验知识,在nuScenes数据集上提升了BEV分割的性能与质量(相较于全监督基线,mIoU最高提升+2.5个百分点)。该方法同时将标注数据的使用量减半,并将总训练时间减少多达三分之二。结果表明,可微分重投影结合相机视角伪标签能够产生可迁移的BEV特征,并为减少标注的自动驾驶感知提供了一条可扩展的路径。