MIMIC: Masked Image Modeling with Image Correspondences

Dense pixel-specific representation learning at scale has been bottlenecked due to the unavailability of large-scale multi-view datasets. Current methods for building effective pretraining datasets heavily rely on annotated 3D meshes, point clouds, and camera parameters from simulated environments, preventing them from building datasets from real-world data sources where such metadata is lacking. We propose a pretraining dataset-curation approach that does not require any additional annotations. Our method allows us to generate multi-view datasets from both real-world videos and simulated environments at scale. Specifically, we experiment with two scales: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs. We train multiple models with different masked image modeling objectives to showcase the following findings: Representations trained on our automatically generated MIMIC-3M outperform those learned from expensive crowdsourced datasets (ImageNet-1K) and those learned from synthetic environments (MULTIVIEW-HABITAT) on two dense geometric tasks: depth estimation on NYUv2 (1.7%), and surface normals estimation on Taskonomy (2.05%). For dense tasks which also require object understanding, we outperform MULTIVIEW-HABITAT, on semantic segmentation on ADE20K (3.89%), pose estimation on MSCOCO (9.4%), and reduce the gap with models pre-trained on the object-centric expensive ImageNet-1K. We outperform even when the representations are frozen, and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

翻译：大规模密集像素级表征学习因缺乏多视角数据集而受到瓶颈。当前构建有效预训练数据集的方法严重依赖模拟环境中带标注的三维网格、点云和相机参数，这阻碍了从缺少此类元数据的真实世界数据源构建数据集。我们提出一种无需任何额外标注的预训练数据集策展方法，该方法能从真实世界视频和模拟环境中大规模生成多视角数据集。具体而言，我们在两种规模上进行实验：包含130万对多视角图像的MIMIC-1M，以及包含310万对的MIMIC-3M。我们采用不同掩码图像建模目标训练多个模型，以展示以下发现：在两项密集几何任务上——NYUv2深度估计（提升1.7%）和Taskonomy表面法线估计（提升2.05%），基于自动生成的MIMIC-3M训练的表征优于使用昂贵众包数据集（ImageNet-1K）和合成环境（MULTIVIEW-HABITAT）获得的表征。在同样需要物体理解的密集任务中，我们在ADE20K语义分割（3.89%）和MSCOCO姿态估计（9.4%）上超越MULTIVIEW-HABITAT，并缩小了与基于物体中心的昂贵ImageNet-1K预训练模型之间的差距。即使表征被冻结且下游训练数据量有限（少样本场景），我们的方法仍表现更优。更大规模数据集（MIMIC-3M）显著提升性能，鉴于我们的策展方法可任意扩展以生成更大规模数据集，这一结果令人振奋。MIMIC代码、数据集和预训练模型已在https://github.com/RAIVNLab/MIMIC 开源。