Using synthesized images to boost the performance of perception models is a long-standing research challenge in computer vision. It becomes more eminent in visual-centric autonomous driving systems with multi-view cameras as some long-tail scenarios can never be collected. Guided by the BEV segmentation layouts, the existing generative networks seem to synthesize photo-realistic street-view images when evaluated solely on scene-level metrics. However, once zoom-in, they usually fail to produce accurate foreground and background details such as heading. To this end, we propose a two-stage generative method, dubbed BEVControl, that can generate accurate foreground and background contents. In contrast to segmentation-like input, it also supports sketch style input, which is more flexible for humans to edit. In addition, we propose a comprehensive multi-level evaluation protocol to fairly compare the quality of the generated scene, foreground object, and background geometry. Our extensive experiments show that our BEVControl surpasses the state-of-the-art method, BEVGen, by a significant margin, from 5.89 to 26.80 on foreground segmentation mIoU. In addition, we show that using images generated by BEVControl to train the downstream perception model, it achieves on average 1.29 improvement in NDS score.
翻译:使用合成图像提升感知模型性能是计算机视觉领域长期存在的研究挑战。在基于多视角摄像头的视觉中心自动驾驶系统中,由于某些长尾场景永远无法收集,这一问题变得尤为突出。现有生成网络在BEV分割布局引导下,若仅通过场景级指标评估,似乎能合成逼真的街景图像。然而,一旦放大细节,它们通常无法生成精确的前景和背景细节(如朝向)。为此,我们提出一种名为BEVControl的两阶段生成方法,能够生成精确的前景和背景内容。与类似分割的输入不同,该方法还支持更便于人类编辑的草图风格输入。此外,我们提出一套全面的多层级评估协议,以公平比较生成场景、前景物体和背景几何的质量。大量实验表明,我们的BEVControl在前景分割mIoU指标上以5.89至26.80的显著优势超越现有最先进方法BEVGen。同时,使用BEVControl生成的图像训练下游感知模型,NDS评分平均提升1.29。