Motivated by the fact that forward and backward passes of a deep network naturally form symmetric mappings between input and output representations, we introduce a simple yet effective self-supervised vision model pretraining framework inspired by energy-based models (EBMs). In the proposed framework, we model energy estimation and data restoration as the forward and backward passes of a single network without any auxiliary components, e.g., an extra decoder. For the forward pass, we fit a network to an energy function that assigns low energy scores to samples that belong to an unlabeled dataset, and high energy otherwise. For the backward pass, we restore data from corrupted versions iteratively using gradient-based optimization along the direction of energy minimization. In this way, we naturally fold the encoder-decoder architecture widely used in masked image modeling into the forward and backward passes of a single vision model. Thus, our framework now accepts a wide range of pretext tasks with different data corruption methods, and permits models to be pretrained from masked image modeling, patch sorting, and image restoration, including super-resolution, denoising, and colorization. We support our findings with extensive experiments, and show the proposed method delivers comparable and even better performance with remarkably fewer epochs of training compared to the state-of-the-art self-supervised vision model pretraining methods. Our findings shed light on further exploring self-supervised vision model pretraining and pretext tasks beyond masked image modeling.
翻译:受深度网络前向与反向传播天然形成输入输出表示对称映射的启发,我们提出一种简洁高效的自监督视觉模型预训练框架,该框架受能量模型(EBMs)启发。在该框架中,我们将能量估计与数据修复建模为单一网络的前向与反向传播,无需任何辅助组件(如额外解码器)。在前向传播中,我们将网络拟合为能量函数,为无标签数据集样本分配低能量值,其他样本则分配高能量值。在反向传播中,我们沿能量最小化方向通过梯度优化迭代修复受损数据。由此,我们巧妙地将掩码图像建模中广泛使用的编码器-解码器架构折叠到单一视觉模型的前向与反向传播中。因此,本框架可兼容多种不同数据损坏方式的预文本任务,支持从掩码图像建模、补丁排序、图像修复(包括超分辨率、去噪和着色)中进行模型预训练。通过大量实验验证,我们发现该方法能够以显著更少的训练轮次达到与现有最优自监督视觉模型预训练方法相当甚至更优的性能。本发现为探索超越掩码图像建模的自监督视觉模型预训练及预文本任务提供了新思路。