Bird's-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first self-supervised approach for generating a BEV semantic map using a single monocular image from the frontal view (FV). During training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Thus, we propose the SkyEye architecture that learns based on two modes of self-supervision, namely, implicit supervision and explicit supervision. Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates. Extensive evaluations on the KITTI-360 dataset demonstrate that our self-supervised approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in the BEV compared to fully supervised approaches. Finally, we publicly release both our code and the BEV datasets generated from the KITTI-360 and Waymo datasets.
翻译:鸟瞰语义地图为自动驾驶决策任务提供了丰富的表征,已成为自动驾驶流程中的关键组成部分。然而,现有生成鸟瞰地图的方法仍遵循全监督训练范式,依赖大量带标注的鸟瞰数据。本文针对这一局限性,首次提出利用单目前视图像生成鸟瞰语义地图的自监督方法。训练过程中,我们通过利用视频序列中更易获取的前视语义标注,克服了对鸟瞰真值标注的需求。为此,我们提出SkyEye架构,该架构基于隐式监督和显式监督两种自监督模式进行学习。隐式监督通过前视语义序列强制场景在时间维度上的空间一致性来训练模型;显式监督则利用前视语义标注与自监督深度估计生成的鸟瞰伪标签进行训练。在KITTI-360数据集上的广泛评估表明,我们的自监督方法与当前最先进的全监督方法性能相当,且相比全监督方法仅需1%的鸟瞰直接监督即可获得具有竞争力的结果。最后,我们公开发布了代码以及从KITTI-360和Waymo数据集中生成的鸟瞰数据集。