Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV representation, and are therefore restricted to the output modality they have been trained for. In contrast, we propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. This is achieved by disentangling the geometric inverse perspective projection from the modality transformation, eg. RGB to occupancy. The method is general and we showcase experiments projecting to BEV three different modalities: semantic segmentation, motion vectors and object bounding boxes detected in first person. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
翻译:鸟瞰图(BEV)是一种重要的几何结构化表示,广泛应用于机器人领域,特别是自动驾驶车辆和地面机器人。现有算法要么需要深度信息进行几何投影(但深度信息并非总是可靠可用),要么以全监督方式端到端训练,将视觉第一人称观测映射到BEV表示,因此受限于训练时所针对的输出模态。相比之下,我们提出了一种新模型,能够将第一人称视角中的任意模态零样本投影到对应的BEV图。这是通过将几何逆透视投影与模态变换(例如RGB到占用率)解耦实现的。该方法具有通用性,我们展示了将三种不同模态(语义分割、运动矢量和第一人称检测到的目标边界框)投影到BEV的实验。实验表明,该模型优于竞争方法,特别是广泛使用的基于单目深度估计的基线方法。