Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization. Code and data are available at https://github.com/HansPolo113/FAIL.
翻译:流匹配模型的后训练——将输出分布与高质量目标对齐——在数学上等同于模仿学习。虽然监督微调能有效模仿专家演示,但无法纠正未见状态中的策略漂移。偏好优化方法解决了这一问题,但需要昂贵的偏好对或奖励建模。我们提出了流匹配对抗模仿学习(FAIL),该方法通过对抗训练最小化策略与专家之间的差异,无需显式奖励或成对比较。我们推导出两种算法:FAIL-PD利用可微分ODE求解器获得低方差路径梯度,而FAIL-PG为离散或计算受限场景提供黑盒替代方案。仅使用Nano Banana pro的13,000个演示对FLUX进行微调,FAIL在提示跟随和美学基准测试中取得了有竞争力的性能。此外,该框架能有效泛化至离散图像和视频生成,并可作为鲁棒正则化器缓解基于奖励的优化中的奖励破解问题。代码和数据可在https://github.com/HansPolo113/FAIL获取。