Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.
翻译:基于流的生成模型与强化学习(RL)的最新进展提升了文本-图像对齐效果与视觉质量。然而,当前针对流模型的RL训练仍存在两个主要问题:(i)GRPO风格的固定每提示组大小忽略了不同提示间采样重要性的差异,导致采样效率低下且训练速度减慢;(ii)轨迹级优势被重复用作每步估计,这沿流路径引入了信用分配的偏差。我们提出SuperFlow,一个面向流模型的RL训练框架,该框架通过方差感知采样调整组大小,并以符合连续时间流动力学的方式计算步级优势。实验表明,SuperFlow仅需原始训练步数的5.4%至56.3%即可达到有前景的性能,并在不改变模型架构的情况下将训练时间减少5.2%至16.7%。在包括文本渲染、组合图像生成和人类偏好对齐在内的标准文本到图像(T2I)任务上,SuperFlow相较于SD3.5-M提升了4.6%至47.2%,相较于Flow-GRPO提升了1.7%至16.0%。