Modern diffusion/flow-based models for image generation typically exhibit two core characteristics: (i) using multi-step sampling, and (ii) operating in a latent space. Recent advances have made encouraging progress on each aspect individually, paving the way toward one-step diffusion/flow without latents. In this work, we take a further step towards this goal and propose "pixel MeanFlow" (pMF). Our core guideline is to formulate the network output space and the loss space separately. The network target is designed to be on a presumed low-dimensional image manifold (i.e., x-prediction), while the loss is defined via MeanFlow in the velocity space. We introduce a simple transformation between the image manifold and the average velocity field. In experiments, pMF achieves strong results for one-step latent-free generation on ImageNet at 256x256 resolution (2.22 FID) and 512x512 resolution (2.48 FID), filling a key missing piece in this regime. We hope that our study will further advance the boundaries of diffusion/flow-based generative models.
翻译:现代基于扩散/流的图像生成模型通常展现出两个核心特征:(i) 采用多步采样,(ii) 在潜空间中进行操作。近期研究已在这两个独立方面取得了令人鼓舞的进展,为无需潜空间的一步式扩散/流模型开辟了道路。本工作中,我们朝着这一目标迈出了新的一步,提出了“像素均值流”方法。我们的核心设计原则是将网络输出空间与损失空间分别进行建模:网络目标被设计在预设的低维图像流形上(即x预测),而损失则通过速度空间中的均值流进行定义。我们引入了图像流形与平均速度场之间的简单变换关系。实验表明,pMF在ImageNet数据集256×256分辨率(2.22 FID)和512×512分辨率(2.48 FID)的一步式无潜变量生成任务中取得了优异性能,填补了该领域的关键空白。我们希望本研究能进一步推动基于扩散/流的生成模型的发展边界。