Recent video diffusion models generate photorealistic, temporally coherent videos, yet they fall short as reliable world models for autonomous driving, where structured motion and physically consistent interactions are essential. Adapting these generalist video models to driving domains has shown promise but typically requires massive domain-specific data and costly fine-tuning. We propose an efficient adaptation framework that converts generalist video diffusion models into controllable driving world models with minimal supervision. The key idea is to decouple motion learning from appearance synthesis. First, the model is adapted to predict structured motion in a simplified form: videos of skeletonized agents and scene elements, focusing learning on physical and social plausibility. Then, the same backbone is reused to synthesize realistic RGB videos conditioned on these motion sequences, effectively "dressing" the motion with texture and lighting. This two-stage process mirrors a reasoning-rendering paradigm: first infer dynamics, then render appearance. Our experiments show this decoupled approach is exceptionally efficient: adapting SVD, we match prior SOTA models with less than 6% of their compute. Scaling to LTX, our MAD-LTX model outperforms all open-source competitors, and supports a comprehensive suite of text, ego, and object controls. Project page: https://vita-epfl.github.io/MAD-World-Model/
翻译:近期视频扩散模型能够生成具有照片级真实感且时序连贯的视频,然而作为自动驾驶领域的可靠世界模型仍显不足,因为该领域需要结构化的运动与物理一致的交互。将这类通用视频模型适配至驾驶领域已展现出潜力,但通常需要大量领域特定数据及昂贵的微调成本。我们提出一种高效适配框架,能够以最小监督将通用视频扩散模型转化为可控的驾驶世界模型。其核心思想是将运动学习与外观合成解耦。首先,模型被适配为以简化形式预测结构化运动:生成骨架化智能体与场景元素的视频,使学习聚焦于物理与社会合理性。随后,复用同一骨干网络,以这些运动序列为条件合成真实感RGB视频,实现为运动“赋予”纹理与光照。这种两阶段流程遵循推理-渲染范式:先推断动态,再渲染外观。实验表明,这种解耦方法具有卓越的效率:通过适配SVD模型,我们仅用不到6%的计算量即达到先前SOTA模型性能。扩展至LTX模型后,我们的MAD-LTX模型超越了所有开源竞品,并支持文本、自车视角及物体控制等完整控制功能。项目页面:https://vita-epfl.github.io/MAD-World-Model/