Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.
翻译:可控视频生成已成为自动驾驶领域的一项多功能工具,能够实现交通场景的真实合成。然而,现有方法在推理时依赖控制信号来引导生成模型实现动态对象的时序一致生成,这限制了其作为可扩展和通用化数据引擎的实用性。本文提出局部语义对齐(LSA),一种用于微调预训练视频生成模型的简洁而有效的框架。LSA通过对真实视频片段与生成视频片段间的语义特征进行对齐来增强时序一致性。具体而言,我们利用现成的特征提取模型,在围绕动态对象的局部区域内比较真实视频片段与生成视频片段的输出,从而构建语义特征一致性损失。通过将该损失与标准扩散损失相结合,我们对基础模型进行微调。使用我们提出的新损失函数仅微调一个周期后,模型在常见视频生成评估指标上均优于基线方法。为进一步测试生成视频的时序一致性,我们借鉴目标检测任务中的两个指标——平均精度均值(mAP)和平均交并比(mIoU)进行适配。在nuScenes和KITTI数据集上的大量实验表明,我们的方法能有效提升视频生成的时序一致性,且无需在推理阶段引入外部控制信号,也不会产生额外计算开销。