Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models

Advaith Ravishankar,Serena Liu,Mingyang Wang,Todd Zhou,Jeffrey Zhou,Arnav Sharma,Ziling Hu,Léopold Das,Abdulaziz Sobirov,Faizaan Siddique,Freddy Yu,Seungjoo Baek,Yan Luo,Mengyu Wang

State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.

翻译：最先进的文生图模型能生成高质量图像，但推理成本高昂，因为生成过程需要多个顺序的常微分方程或去噪步骤。原生一步模型旨在通过单步将噪声映射为图像来降低成本，然而与多步系统进行公平比较十分困难，因为不同研究使用了不匹配的采样步数和不同的无分类器引导（CFG）设置，而CFG可能将FID、Inception Score和基于CLIP的比对分数推向相反方向。此外，一步模型能否有效扩展到多步推理，目前也不明确，且除ImageNet外，针对标签ID条件生成器缺乏标准化的分布外评估。为解决这些问题，我们在受控的类别条件协议下，对涵盖一步流模型（MeanFlow、Improved MeanFlow、SoFlow）、多步基线模型（RAE、Scale-RAE）以及成熟系统（SiT、Stable Diffusion 3.5、FLUX.1）的八种模型进行基准测试，数据集包括ImageNet验证集、ImageNetV2以及我们新整理的、与ImageNet标签ID对齐的分布外数据集reLAIONet。利用FID、Inception Score、CLIP Score和Pick Score，我们发现，在少步数场景下，以FID为核心的模型开发和CFG选择可能具有误导性，因为引导调整可能在提升FID的同时，损害图文比对质量与人类偏好信号，并降低感知质量。我们进一步证明，领先的一步模型能从步数缩放中获益，并在多步推理下变得更具竞争力，不过它们仍会出现特征性的局部失真。为捕捉这些权衡，我们引入MinMax调和均值（MMHM）——一种涵盖全部四项指标的复合代理指标，可在引导和步数扫描中稳定超参数选择。