Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.
翻译:自动驾驶依赖于在大规模高质量多视角驾驶视频上训练的鲁棒模型。尽管世界模型为生成逼真驾驶数据提供了一种经济高效的解决方案,但它们通常存在身份漂移问题,即同一物体由于缺乏实例级时间约束而在帧间改变其外观或类别。我们提出了ConsisDrive,一种身份保持型驾驶世界模型,旨在实例级别强制执行时间一致性。我们的框架包含两个关键组件:(1) 实例掩码注意力机制,在注意力块中应用实例身份掩码和轨迹掩码,确保视觉标记仅在空间和时间维度上与其对应的实例特征交互,从而保持物体身份一致性;(2) 实例掩码损失函数,通过概率性实例掩码自适应地强调前景区域,在保持整体场景保真度的同时减少背景噪声。通过整合这些机制,ConsisDrive在驾驶视频生成质量上达到了最先进水平,并在nuScenes数据集的下游自动驾驶任务中展现出显著改进。项目页面为https://shanpoyang654.github.io/ConsisDrive/page.html。