Driven by the emergence of Controllable Video Diffusion, existing Sim2Real methods for autonomous driving video generation typically rely on explicit intermediate representations to bridge the domain gap. However, these modalities face a fundamental Consistency-Realism Dilemma. Low-level signals (e.g., edges, blurred images) ensure precise control but compromise realism by "baking in" synthetic artifacts, whereas high-level priors (e.g., depth, semantics, HDMaps) facilitate photorealism but lack the structural detail required for consistent guidance. In this work, we present Driving with DINO (DwD), a novel framework that leverages Vision Foundation Module (VFM) features as a unified bridge between the simulation and real-world domains. We first identify that these features encode a spectrum of information, from high-level semantics to fine-grained structure. To effectively utilize this, we employ Principal Subspace Projection to discard the high-frequency elements responsible for "texture baking," while concurrently introducing Random Channel Tail Drop to mitigate the structural loss inherent in rigid dimensionality reduction, thereby reconciling realism with control consistency. Furthermore, to fully leverage DINOv3's high-resolution capabilities for enhancing control precision, we introduce a learnable Spatial Alignment Module that adapts these high-resolution features to the diffusion backbone. Finally, we propose a Causal Temporal Aggregator employing causal convolutions to explicitly preserve historical motion context when integrating frame-wise DINO features, which effectively mitigates motion blur and guarantees temporal stability. Project page: https://albertchen98.github.io/DwD-project/
翻译:受可控视频扩散模型兴起的推动,现有的自动驾驶视频生成仿真到现实方法通常依赖显式中间表示来弥合领域差距。然而,这些模态面临一个根本性的"一致性-真实感困境"。低层级信号(如边缘、模糊图像)能确保精确控制,但会因"固化"合成伪影而损害真实感;而高层级先验(如深度、语义、高清地图)虽能促进照片级真实感,却缺乏保持一致性引导所需的结构细节。本文提出基于DINO的自动驾驶框架,这是一种利用视觉基础模型特征作为仿真域与现实域之间统一桥梁的新颖框架。我们首先发现这些特征编码了从高层语义到细粒度结构的信息谱。为有效利用这一特性,我们采用主成分子空间投影来丢弃导致"纹理固化"的高频成分,同时引入随机通道尾部丢弃来缓解刚性降维固有的结构损失,从而协调真实感与控制一致性。此外,为充分发挥DINOv3高分辨率能力以提升控制精度,我们提出可学习的空间对齐模块,将高分辨率特征适配到扩散主干网络。最后,我们设计采用因果卷积的因果时序聚合器,在整合逐帧DINO特征时显式保留历史运动上下文,这有效缓解运动模糊并保证时序稳定性。项目页面:https://albertchen98.github.io/DwD-project/