Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.
翻译:现代基于扩散/流的图像生成模型通常具有两个核心特征:(i)采用多步采样,(ii)在潜空间中进行操作。最近的研究在单个方面取得了令人鼓舞的进展,为无需潜空间的一步式扩散/流模型铺平了道路。本工作中,我们朝着这一目标更进一步,提出了"像素均值流"方法。我们的核心指导原则是分别构建网络输出空间与损失函数空间:网络输出目标被设计在预设的低维图像流形上(即x-预测),而损失函数则通过速度空间中的均值流进行定义。我们引入了图像流形与平均速度场之间的简单变换关系。实验表明,pMF在ImageNet数据集256×256分辨率(2.22 FID)和512×512分辨率(2.48 FID)的一步式无潜变量生成任务中取得了优异性能,填补了该领域的关键空白。我们希望本研究能进一步推动基于扩散/流的生成模型的发展边界。