We introduce a self-supervised pretraining method, called OcFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach.
翻译:我们提出了一种名为OccFeat的自监督预训练方法,用于纯视觉鸟瞰图(BEV)分割网络。通过OccFeat,我们利用占据预测与特征蒸馏任务对BEV网络进行预训练。占据预测使模型获得场景的三维几何理解能力,但所学几何特征是类别无关的。因此,我们通过从自监督预训练图像基础模型中进行特征蒸馏,在三维空间中为模型注入语义信息。经本方法预训练的模型在BEV语义分割任务中表现出更优性能,尤其在低数据场景下。此外,实验结果表明,将特征蒸馏与三维占据预测相结合的策略在本预训练方法中具有显著有效性。