Using synthesized images to boost the performance of perception models is a long-standing research challenge in computer vision. It becomes more eminent in visual-centric autonomous driving systems with multi-view cameras as some long-tail scenarios can never be collected. Guided by the BEV segmentation layouts, the existing generative networks seem to synthesize photo-realistic street-view images when evaluated solely on scene-level metrics. However, once zoom-in, they usually fail to produce accurate foreground and background details such as heading. To this end, we propose a two-stage generative method, dubbed BEVControl, that can generate accurate foreground and background contents. In contrast to segmentation-like input, it also supports sketch style input, which is more flexible for humans to edit. In addition, we propose a comprehensive multi-level evaluation protocol to fairly compare the quality of the generated scene, foreground object, and background geometry. Our extensive experiments show that our BEVControl surpasses the state-of-the-art method, BEVGen, by a significant margin, from 5.89 to 26.80 on foreground segmentation mIoU. In addition, we show that using images generated by BEVControl to train the downstream perception model, it achieves on average 1.29 improvement in NDS score.
翻译:使用合成图像增强感知模型性能是计算机视觉领域长期存在的研究挑战。在多摄像头视觉主导的自动驾驶系统中,由于某些长尾场景永远无法通过实际采集获得,这一问题尤为突出。现有生成网络在BEV分割布局的引导下,若仅以场景级指标评估,似乎能合成出逼真的街景图像。然而,一旦放大观察,这些图像通常无法生成精确的前景和背景细节(如朝向)。为此,我们提出一种名为BEVControl的两阶段生成方法,能够生成精确的前景与背景内容。与分割类输入不同,该方法还支持更便于人类编辑的草图风格输入。此外,我们提出一套全面的多层级评估协议,以公平比较生成场景、前景物体和背景几何的质量。大量实验表明,我们的BEVControl方法在分割前景mIoU指标上以5.89至26.80的显著优势超越当前最先进方法BEVGen。同时,实验证明使用BEVControl生成的图像训练下游感知模型,NDS评分平均提升1.29。