Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at https://luhannan.github.io/CogDrivingPage/.
翻译:为自动驾驶训练生成多视角视频近来备受关注,其挑战在于同时处理跨视角与跨帧的一致性。现有方法通常对空间、时间和视角维度采用解耦的注意力机制。然而,这些方法往往难以维持跨维度的一致性,尤其是在处理出现在不同时间和视角的快速移动物体时。本文提出CogDriving,一种专为合成高质量多视角驾驶视频而设计的新型网络。CogDriving采用基于整体四维注意力模块的Diffusion Transformer架构,能够实现跨空间、时间和视角维度的同步关联。我们还为CogDriving量身定制了一个轻量级控制器,即Micro-Controller,其参数量仅为标准ControlNet的1.1%,从而实现对鸟瞰图布局的精确控制。为增强对自动驾驶至关重要的物体实例的生成,我们提出一种重加权学习目标,在训练过程中动态调整物体实例的学习权重。CogDriving在nuScenes验证集上表现出色,取得了37.8的FVD分数,突显了其生成逼真驾驶视频的能力。项目地址为 https://luhannan.github.io/CogDrivingPage/。