We introduce Correlational Image Modeling (CIM), a novel and surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplars) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target encoders. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.
翻译:我们提出关联图像建模(CIM),一种新颖且效果显著的自监督视觉预训练方法。CIM执行一项简单的预训练任务:从输入图像(上下文)中随机裁剪图像区域(示例),并预测示例与上下文之间的关联图。三项关键设计使关联图像建模成为一项非平凡且有意义的自监督任务。首先,为生成有效的示例-上下文对,我们考虑对图像区域进行多尺度、多形状、多旋转角度及多种变换的裁剪。其次,采用包含在线编码器和目标编码器的引导学习框架:预训练过程中,前者将示例作为输入,后者则对上下文进行编码。第三,通过简单的交叉注意力模块对输出的关联图进行建模,其中上下文作为查询,示例提供键与值。实验表明,CIM在自监督和迁移学习基准测试中的性能与当前最先进方法相当甚至更优。