Learning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D annotations, limiting the scalability, or focus on single-frame or monocular inputs, neglecting the temporal information. We propose MIM4D, a novel pre-training paradigm based on dual masked image modeling (MIM). MIM4D leverages both spatial and temporal relations by training on masked multi-view video inputs. It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision. To address the lack of dense 3D supervision, MIM4D reconstruct pixels by employing 3D volumetric differentiable rendering to learn geometric representations. We demonstrate that MIM4D achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving. It significantly improves existing methods on multiple downstream tasks, including BEV segmentation (8.7% IoU), 3D object detection (3.5% mAP), and HD map construction (1.4% mAP). Our work offers a new choice for learning representation at scale in autonomous driving. Code and models are released at https://github.com/hustvl/MIM4D
翻译:从大规模多视角视频数据中学习鲁棒且可扩展的视觉表征,仍是计算机视觉与自动驾驶领域的一项挑战。现有预训练方法要么依赖昂贵的3D标注进行监督学习,限制了可扩展性,要么仅关注单帧或单目输入,忽略了时序信息。我们提出MIM4D,一种基于双重掩码图像建模(MIM)的新型预训练范式。MIM4D通过对掩码多视角视频输入进行训练,同时利用空间与时间关联。它利用连续场景流构建伪3D特征,并将其投影至2D平面以提供监督信号。为解决缺乏密集3D监督的问题,MIM4D采用3D体素可微渲染重建像素,从而学习几何表征。实验表明,MIM4D在nuScenes数据集上实现了自动驾驶视觉表征学习的最优性能。它在多个下游任务中显著提升了现有方法的表现,包括BEV分割(IoU提升8.7%)、3D目标检测(mAP提升3.5%)和高精地图构建(mAP提升1.4%)。本工作为自动驾驶领域的大规模表征学习提供了新选择。代码与模型已发布于https://github.com/hustvl/MIM4D。