3D perception is a critical problem in autonomous driving. Recently, the Bird-Eye-View (BEV) approach has attracted extensive attention, due to low-cost deployment and desirable vision detection capacity. However, the existing models ignore a realistic scenario during the driving procedure, i.e., one or more view cameras may be failed, which largely deteriorates the performance. To tackle this problem, we propose a generic Masked BEV (M-BEV) perception framework, which can effectively improve robustness to this challenging scenario, by random masking and reconstructing camera views in the end-to-end training. More specifically, we develop a novel Masked View Reconstruction (MVR) module for M-BEV. It mimics various missing cases by randomly masking features of different camera views, then leverages the original features of these views as self-supervision, and reconstructs the masked ones with the distinct spatio-temporal context across views. Via such a plug-and-play MVR, our M-BEV is capable of learning the missing views from the resting ones, and thus well generalized for robust view recovery and accurate perception in the testing. We perform extensive experiments on the popular NuScenes benchmark, where our framework can significantly boost 3D perception performance of the state-of-the-art models on various missing view cases, e.g., for the absence of back view, our M-BEV promotes the PETRv2 model with 10.3% mAP gain.
翻译:3D感知是自动驾驶中的关键问题。近年来,鸟瞰图(BEV)方法因部署成本低且视觉检测能力理想而受到广泛关注。然而,现有模型忽略了驾驶过程中的一种现实场景,即一个或多个视角相机可能发生故障,这严重降低了性能。为解决此问题,我们提出了一种通用的掩码BEV(M-BEV)感知框架,通过端到端训练中的随机掩码与相机视角重建,有效提升了对此类挑战性场景的鲁棒性。具体而言,我们为M-BEV开发了一种新颖的掩码视角重建(MVR)模块。该模块通过随机掩码不同相机视角的特征来模拟多种缺失情况,然后利用这些视角的原始特征作为自监督信号,并借助跨视角的独特时空上下文重建被掩码的特征。通过这种即插即用的MVR,我们的M-BEV能够从剩余视角中学习缺失视角,从而在测试中实现鲁棒的视角恢复和精确感知。我们在流行的NuScenes基准上进行了大量实验,结果表明,我们的框架在多种视角缺失情况下显著提升了最先进模型的3D感知性能,例如,对于后视角缺失的情况,我们的M-BEV使PETRv2模型获得了10.3%的mAP提升。