Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.
翻译:许多像素级密集预测任务——如计算机视觉中的深度估计和语义分割——如今依赖于预训练的图像表示。因此,构建有效的预训练数据集至关重要。遗憾的是,目前有效的预训练数据集仅包含多视角场景,且仅能通过使用模拟环境中的带注释3D网格、点云和相机参数来构建。我们提出了一种无需任何标注的数据集构建机制。我们从开源视频数据集和合成3D环境中挖掘了两个数据集:包含130万对多视角图像的MIMIC-1M和包含310万对多视角图像的MIMIC-3M。我们使用不同的掩码图像建模目标训练了多个自监督模型,以展示以下发现:在MIMIC-3M上训练得到的表示在包括深度估计、语义分割、表面法线估计和位姿估计在内的多个下游任务中,优于使用标注挖掘的表示。这些表示在下游训练数据受限的少样本场景及冻结表示时同样表现更优。更大的数据集(MIMIC-3M)显著提升了性能,这极具前景,因为我们的构建方法可任意扩展以生成更大规模的数据集。MIMIC的代码、数据集和预训练模型已在https://github.com/RAIVNLab/MIMIC开源。