We introduce Correlational Image Modeling (CIM), a novel and surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplars) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target encoders. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.
翻译:我们提出相关图像建模(CIM),这是一种新颖且效果显著的自我监督视觉预训练方法。CIM执行一个简单的预训练任务:我们从输入图像(上下文)中随机裁剪图像区域(示例),并预测示例与上下文之间的相关图。三个关键设计使相关图像建模成为一项具有挑战性且意义深远的自监督任务。首先,为生成有用的示例-上下文对,我们考虑以不同尺度、形状、旋转和变换方式裁剪图像区域。其次,我们采用包含在线编码器和目标编码器的引导学习框架。预训练过程中,前者将示例作为输入,后者转换上下文。第三,我们通过简单的交叉注意力模块对输出的相关图进行建模,其中上下文作为查询,示例提供键和值。实验表明,CIM在自监督和迁移学习基准测试中达到或超越了当前最先进水平。