There has been a longstanding belief that generation can facilitate a true understanding of visual data. In line with this, we revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach is capable of (i) serving as a strong initialization for downstream recognition tasks, (ii) conducting high-quality image inpainting, and (iii) being effortlessly extended to video where it produces state-of-the-art classification accuracy. We further perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
翻译:长久以来,人们一直相信生成过程能够促进对视觉数据的真正理解。基于这一理念,我们结合近期对去噪扩散模型的兴趣,重新审视了视觉表征的生成式预训练。虽然直接使用扩散模型进行预训练无法产生强表征,但我们通过将扩散模型条件化于掩码输入,将其构建为掩码自编码器(DiffMAE)。我们的方法能够:(i)作为下游识别任务的强初始化基础,(ii)实现高质量图像修复,(iii)轻松扩展至视频领域,并在分类精度上达到领先水平。我们进一步对设计选择的利弊进行了全面研究,并建立了扩散模型与掩码自编码器之间的关联。