Generating multi-camera street-view videos is critical for augmenting autonomous driving datasets, addressing the urgent demand for extensive and varied data. Due to the limitations in diversity and challenges in handling lighting conditions, traditional rendering-based methods are increasingly being supplanted by diffusion-based methods. However, a significant challenge in diffusion-based methods is ensuring that the generated sensor data preserve both intra-world consistency and inter-sensor coherence. To address these challenges, we combine an additional explicit world volume and propose the World Volume-aware Multi-camera Driving Scene Generator (WoVoGen). This system is specifically designed to leverage 4D world volume as a foundational element for video generation. Our model operates in two distinct phases: (i) envisioning the future 4D temporal world volume based on vehicle control sequences, and (ii) generating multi-camera videos, informed by this envisioned 4D temporal world volume and sensor interconnectivity. The incorporation of the 4D world volume empowers WoVoGen not only to generate high-quality street-view videos in response to vehicle control inputs but also to facilitate scene editing tasks.
翻译:生成多相机街景视频对于扩充自动驾驶数据集至关重要,这满足了对海量多样化数据的迫切需求。由于在数据多样性方面的局限性以及处理光照条件时面临的挑战,传统的基于渲染的方法正逐渐被基于扩散的方法所取代。然而,基于扩散的方法面临一个重大挑战:如何确保生成的传感器数据既保持世界内的一致性,又维持传感器间的连贯性。为解决这些难题,我们结合了一个显式的附加世界体积,并提出了世界体积感知的多相机驾驶场景生成器(WoVoGen)。该系统专门设计用于将4D世界体积作为视频生成的基础元素。我们的模型通过两个不同的阶段运行:(i)基于车辆控制序列构想未来的4D时序世界体积,以及(ii)根据构想的4D时序世界体积和传感器互连性生成多相机视频。4D世界体积的引入使WoVoGen不仅能够根据车辆控制输入生成高质量的街景视频,还能支持场景编辑任务。