Autonomous driving requires robust perception models trained on high-quality, large-scale multi-view driving videos for tasks like 3D object detection, segmentation and trajectory prediction. While world models provide a cost-effective solution for generating realistic driving videos, challenges remain in ensuring these videos adhere to fundamental physical principles, such as relative and absolute motion, spatial relationship like occlusion and spatial consistency, and temporal consistency. To address these, we propose DrivePhysica, an innovative model designed to generate realistic multi-view driving videos that accurately adhere to essential physical principles through three key advancements: (1) a Coordinate System Aligner module that integrates relative and absolute motion features to enhance motion interpretation, (2) an Instance Flow Guidance module that ensures precise temporal consistency via efficient 3D flow extraction, and (3) a Box Coordinate Guidance module that improves spatial relationship understanding and accurately resolves occlusion hierarchies. Grounded in physical principles, we achieve state-of-the-art performance in driving video generation quality (3.96 FID and 38.06 FVD on the Nuscenes dataset) and downstream perception tasks. Our project homepage: https://metadrivescape.github.io/papers_project/DrivePhysica/page.html
翻译:自动驾驶需要基于高质量、大规模多视角驾驶视频训练的鲁棒感知模型,以完成三维目标检测、分割和轨迹预测等任务。虽然世界模型为生成逼真驾驶视频提供了一种经济高效的解决方案,但如何确保这些视频遵循基本物理原理(如相对与绝对运动、遮挡等空间关系及空间一致性、时间一致性)仍面临挑战。为此,我们提出DrivePhysica模型,该创新模型通过三项关键改进生成逼真且严格遵循核心物理原理的多视角驾驶视频:(1)坐标系对齐模块,整合相对与绝对运动特征以增强运动解析能力;(2)实例流引导模块,通过高效三维流提取确保精确的时间一致性;(3)边界框坐标引导模块,提升空间关系理解能力并准确解析遮挡层级。基于物理原理的建模使我们在驾驶视频生成质量(在Nuscenes数据集上达到3.96 FID与38.06 FVD)与下游感知任务中取得了最先进的性能。项目主页:https://metadrivescape.github.io/papers_project/DrivePhysica/page.html