MIMIC: Masked Image Modeling with Image Correspondences

Many pixelwise dense prediction tasks-depth estimation and semantic segmentation in computer vision today rely on pretrained image representations. Therefore, curating effective pretraining datasets is vital. Unfortunately, the effective pretraining datasets are those with multi-view scenes and have only been curated using annotated 3D meshes, point clouds, and camera parameters from simulated environments. We propose a dataset-curation mechanism that does not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and from synthetic 3D environments. We train multiple self-supervised models with different masked image modeling objectives to showcase the following findings: Representations trained on MIMIC-3M outperform those mined using annotations on multiple downstream tasks, including depth estimation, semantic segmentation, surface normals, and pose estimation. They also outperform representations that are frozen and when downstream training data is limited to few-shot. Larger dataset (MIMIC-3M) significantly improves performance, which is promising since our curation method can arbitrarily scale to produce even larger datasets. MIMIC code, dataset, and pretrained models are open-sourced at https://github.com/RAIVNLab/MIMIC.

翻译：许多像素级密集预测任务——如计算机视觉中的深度估计和语义分割——如今依赖于预训练的图像表示。因此，构建有效的预训练数据集至关重要。遗憾的是，目前有效的预训练数据集仅包含多视角场景，且仅能通过使用模拟环境中的带注释3D网格、点云和相机参数来构建。我们提出了一种无需任何标注的数据集构建机制。我们从开源视频数据集和合成3D环境中挖掘了两个数据集：包含130万对多视角图像的MIMIC-1M和包含310万对多视角图像的MIMIC-3M。我们使用不同的掩码图像建模目标训练了多个自监督模型，以展示以下发现：在MIMIC-3M上训练得到的表示在包括深度估计、语义分割、表面法线估计和位姿估计在内的多个下游任务中，优于使用标注挖掘的表示。这些表示在下游训练数据受限的少样本场景及冻结表示时同样表现更优。更大的数据集（MIMIC-3M）显著提升了性能，这极具前景，因为我们的构建方法可任意扩展以生成更大规模的数据集。MIMIC的代码、数据集和预训练模型已在https://github.com/RAIVNLab/MIMIC开源。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日