Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.
翻译:许多像素级密集预测任务——如计算机视觉中的深度估计与语义分割——目前依赖于预训练的图像表示。因此,构建有效的预训练数据集至关重要。遗憾的是,有效的预训练数据集需包含多视角场景,且目前仅能通过模拟环境中带标注的三维网格、点云和相机参数来构建。我们提出了一种无需任何标注的数据集构建机制。我们从开源视频数据集和合成三维环境中挖掘出两个数据集:含130万对多视角图像的MIMIC-1M与含310万对多视角图像的MIMIC-3M。我们使用不同掩码图像建模目标训练了多个自监督模型,以展示以下发现:在MIMIC-3M上训练的表示在深度估计、语义分割、表面法线估计和姿态估计等多个下游任务中优于基于标注挖掘的表示。同时,当采用冻结表示或下游训练数据受限(小样本场景)时,其性能仍更优。更大的数据集(MIMIC-3M)显著提升了性能,鉴于我们的构建方法可任意扩展以生成更大数据集,这一发现极具前景。MIMIC代码、数据集及预训练模型已开源至https://github.com/RAIVNLab/MIMIC。