Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV representation, and are therefore restricted to the output modality they have been trained for. In contrast, we propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. This is achieved by disentangling the geometric inverse perspective projection from the modality transformation, eg. RGB to occupancy. The method is general and we showcase experiments projecting to BEV three different modalities: semantic segmentation, motion vectors and object bounding boxes detected in first person. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
翻译:鸟瞰图(BEV)是一种重要的几何结构化表征,广泛应用于机器人领域,特别是自动驾驶车辆和地面机器人。现有算法或需要几何投影的深度信息(但该信息并非始终可靠),或采用全监督的端到端训练方式将第一人称视觉观测映射到BEV表征,因此受限于其训练所用的输出模态。相比之下,我们提出了一种新模型,能够将第一人称视角中任意可用模态零样本投影至对应的BEV图。这是通过解耦几何逆透视投影与模态变换(例如RGB到占用率)实现的。该方法具有通用性,我们展示了将三种不同模态(语义分割、运动向量以及第一人称视角检测到的目标边界框)投影到BEV的实验结果。实验表明,该模型优于现有方法,特别是广泛使用的基于单目深度估计的基线方法。