At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at https://github.com/philippe-eecs/small-vision.
翻译:在成功的生成模型与自监督表征学习模型的核心,都存在一种包含某种形式图像损坏的重构目标。扩散模型通过预定的高斯损坏过程实现这一方法,而掩码自编码模型则通过遮蔽图像块来实现。尽管方法不同,但其方法论的内在相似性提示了构建一个能同时处理去噪任务的自编码器的可行路径。我们提出了一种统一的自监督目标,称为统一掩码扩散(UMD),它将基于图像块的损坏技术与基于噪声的损坏技术结合在单一自编码框架内。具体而言,UMD通过向扩散噪声调度中引入一个额外的无噪声、高掩码表征步骤来修改扩散Transformer(DiT)的训练过程,并在后续时间步中使用混合了掩码与噪声的图像。通过整合对扩散建模与预测掩码图像块令牌均有用的特征,UMD在下游生成与表征学习任务(包括线性探测与类条件生成)中实现了强劲性能。这一成果无需繁重的数据增强、多视图或额外编码器即可达成。此外,UMD在总训练时间上提升了先前基于扩散方法的计算效率。我们在https://github.com/philippe-eecs/small-vision 发布了代码。