As a class of fruitful approaches, diffusion probabilistic models (DPMs) have shown excellent advantages in high-resolution image reconstruction. On the other hand, masked autoencoders (MAEs), as popular self-supervised vision learners, have demonstrated simpler and more effective image reconstruction and transfer capabilities on downstream tasks. However, they all require extremely high training costs, either due to inherent high temporal-dependence (i.e., excessively long diffusion steps) or due to artificially low spatial-dependence (i.e., human-formulated high mask ratio, such as 0.75). To the end, this paper presents LMD, a faster image reconstruction framework with latent masking diffusion. First, we propose to project and reconstruct images in latent space through a pre-trained variational autoencoder, which is theoretically more efficient than in the pixel-based space. Then, we combine the advantages of MAEs and DPMs to design a progressive masking diffusion model, which gradually increases the masking proportion by three different schedulers and reconstructs the latent features from simple to difficult, without sequentially performing denoising diffusion as in DPMs or using fixed high masking ratio as in MAEs, so as to alleviate the high training time-consumption predicament. Our approach allows for learning high-capacity models and accelerate their training (by 3x or more) and barely reduces the original accuracy. Inference speed in downstream tasks also significantly outperforms the previous approaches.
翻译:作为一种有效的方法类别,扩散概率模型在高分辨率图像重建中展现出显著优势。另一方面,掩码自编码器作为流行的自监督视觉学习器,在下游任务中展示了更简单有效的图像重建与迁移能力。然而,它们都面临极高的训练成本:前者源于固有的高时间依赖性(即过长的扩散步数),后者则源于人为设置的低空间依赖性(即人工设定的高掩码比例,例如0.75)。为此,本文提出LMD——一种基于潜空间掩码扩散的快速图像重建框架。首先,我们通过预训练变分自编码器在潜空间中投影和重建图像,理论上比基于像素空间的方法更高效。随后,融合MAE与DPM的优势,设计渐进式掩码扩散模型:通过三种不同调度器逐步增大掩码比例,从简单到困难地重建潜特征,无需像DPM那样顺序执行去噪扩散,也不必像MAE那样使用固定高掩码比例,从而缓解高训练耗时困境。本方法支持学习高容量模型,可加速训练(3倍以上)且原始精度几乎不降低。下游任务的推理速度亦显著优于现有方法。