State-of-the-art text-to-image models produce high-quality images, but inference remains expensive as generation requires several sequential ODE or denoising steps. Native one-step models aim to reduce this cost by mapping noise to an image in a single step, yet fair comparisons to multi-step systems are difficult because studies use mismatched sampling steps and different classifier-free guidance (CFG) settings, where CFG can shift FID, Inception Score, and CLIP-based alignment in opposing directions. It is also unclear how well one-step models scale to multi-step inference, and there is limited standardized out-of-distribution evaluation for label-ID-conditioned generators beyond ImageNet. To address this, We benchmark eight models spanning one-step flows (MeanFlow, Improved MeanFlow, SoFlow), multi-step baselines (RAE, Scale-RAE), and established systems (SiT, Stable Diffusion 3.5, FLUX.1) under a controlled class-conditional protocol on ImageNet validation, ImageNetV2, and reLAIONet, our new proofread out-of-distribution dataset aligned to ImageNet label IDs. Using FID, Inception Score, CLIP Score, and Pick Score, we show that FID-focused model development and CFG selection can be misleading in few-step regimes, where guidance changes can improve FID while degrading text-image alignment and human preference signals and worsening perceived quality. We further show that leading one-step models benefit from step scaling and become substantially more competitive under multi-step inference, although they still exhibit characteristic local distortions. To capture these tradeoffs, we introduce MinMax Harmonic Mean (MMHM), a composite proxy over all four metrics that stabilizes hyperparameter selection across guidance and step sweeps.
翻译:最先进的文生图模型能生成高质量图像,但推理成本高昂,因为生成过程需要多个顺序的常微分方程或去噪步骤。原生一步模型旨在通过单步将噪声映射为图像来降低成本,然而与多步系统进行公平比较十分困难,因为不同研究使用了不匹配的采样步数和不同的无分类器引导(CFG)设置,而CFG可能将FID、Inception Score和基于CLIP的比对分数推向相反方向。此外,一步模型能否有效扩展到多步推理,目前也不明确,且除ImageNet外,针对标签ID条件生成器缺乏标准化的分布外评估。为解决这些问题,我们在受控的类别条件协议下,对涵盖一步流模型(MeanFlow、Improved MeanFlow、SoFlow)、多步基线模型(RAE、Scale-RAE)以及成熟系统(SiT、Stable Diffusion 3.5、FLUX.1)的八种模型进行基准测试,数据集包括ImageNet验证集、ImageNetV2以及我们新整理的、与ImageNet标签ID对齐的分布外数据集reLAIONet。利用FID、Inception Score、CLIP Score和Pick Score,我们发现,在少步数场景下,以FID为核心的模型开发和CFG选择可能具有误导性,因为引导调整可能在提升FID的同时,损害图文比对质量与人类偏好信号,并降低感知质量。我们进一步证明,领先的一步模型能从步数缩放中获益,并在多步推理下变得更具竞争力,不过它们仍会出现特征性的局部失真。为捕捉这些权衡,我们引入MinMax调和均值(MMHM)——一种涵盖全部四项指标的复合代理指标,可在引导和步数扫描中稳定超参数选择。