Flow-OPD: On-Policy Distillation for Flow Matching Models

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models.

翻译：现有流匹配（Flow Matching, FM）文本到图像模型在多任务对齐过程中面临两个关键瓶颈：标量奖励导致的奖励稀疏性，以及异构目标联合优化引发的梯度干扰，这共同导致了竞争指标的'跷跷板效应'与普遍的奖励破解现象。受大语言模型社区中同策略蒸馏（On-Policy Distillation, OPD）成功实践的启发，我们提出Flow-OPD——首个将同策略蒸馏无缝集成至流匹配模型的统一后训练框架。Flow-OPD采用两阶段对齐策略：首先通过单奖励GRPO微调培育领域专用教师模型，使各专家在隔离环境中达到性能上限；继而基于流式冷启动方案建立稳健的初始策略，并通过同策略采样、任务路由标注与密集轨迹级监督的三步编排，将异构专家知识无缝整合至单一学生模型。我们进一步引入流形锚点正则化（Manifold Anchor Regularization, MAR），利用任务无关教师提供全数据监督，将生成过程锚定至高质量流形，有效缓解纯强化学习对齐中常见的审美退化问题。基于Stable Diffusion 3.5 Medium的实验中，Flow-OPD将GenEval得分从63提升至92，OCR准确率从59%提升至94%，相较原始GRPO实现约10个百分点的综合提升，同时保持图像保真度与人类偏好对齐，并展现出涌现性的'超越教师'效应。这些结果确立了Flow-OPD作为构建通用型文本到图像模型的可扩展对齐范式。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

综述 | OPSD：大语言模型的在线策略自蒸馏

专知会员服务

8+阅读 · 6月1日

[ICML 2026] SOL：让大模型把算力花在关键Token上：自优化语言模型

专知会员服务

7+阅读 · 5月12日

大语言模型同策略蒸馏研究综述

专知会员服务

20+阅读 · 4月5日

生成模型中组相对策略优化 (GRPO) 的研究进展：综述

专知会员服务

8+阅读 · 3月10日