In this work, we survey recent studies on masked image modeling (MIM), an approach that emerged as a powerful self-supervised learning technique in computer vision. The MIM task involves masking some information, e.g. pixels, patches, or even latent representations, and training a model, usually an autoencoder, to predicting the missing information by using the context available in the visible part of the input. We identify and formalize two categories of approaches on how to implement MIM as a pretext task, one based on reconstruction and one based on contrastive learning. Then, we construct a taxonomy and review the most prominent papers in recent years. We complement the manually constructed taxonomy with a dendrogram obtained by applying a hierarchical clustering algorithm. We further identify relevant clusters via manually inspecting the resulting dendrogram. Our review also includes datasets that are commonly used in MIM research. We aggregate the performance results of various masked image modeling methods on the most popular datasets, to facilitate the comparison of competing methods. Finally, we identify research gaps and propose several interesting directions of future work.
翻译:本文系统综述了掩码图像建模(MIM)这一在计算机视觉领域崭露头角的强大自监督学习方法。MIM任务的核心在于对输入信息(如像素、图像块乃至潜在表示)进行部分掩码,并训练模型(通常为自编码器)利用可见部分的上下文信息预测被遮蔽的内容。我们系统归纳并形式化了两类实现MIM预训练任务的技术路径:基于重构的方法与基于对比学习的方法。在此基础上构建了分类体系,并对近年来的代表性文献进行梳理。除人工构建的分类框架外,我们还通过层次聚类算法生成树状图,并辅以人工判读识别出关键聚类簇。本综述同时涵盖了MIM研究常用的数据集,汇总了各类掩码图像建模方法在主流数据集上的性能表现,以促进不同方法的比较分析。最后,我们指出当前研究存在的空白领域,并提出若干值得探索的未来研究方向。