Reinforcement Fine-Tuning (RFT) on flow-based models is crucial for preference alignment. However, they often introduce visual hallucinations like over-optimized details and semantic misalignment. This work preliminarily explores why visual hallucinations arise and how to reduce them. We first investigate RFT methods from a unified perspective, and reveal the core problems stemming from two aspects, exploration and exploitation: (1) limited exploration during stochastic differential equation (SDE) rollouts, leading to an over-emphasis on local details at the expense of global semantics, and (2) trajectory imitation process inherent in policy gradient methods, distorting the model's foundational vector field and its cross-step consistency. Building on this, we propose ConsistentRFT, a general framework to mitigate these hallucinations. Specifically, we design a Dynamic Granularity Rollout (DGR) mechanism to balance exploration between global semantics and local details by dynamically scheduling different noise sources. We then introduce a Consistent Policy Gradient Optimization (CPGO) that preserves the model's consistency by aligning the current policy with a more stable prior. Extensive experiments demonstrate that ConsistentRFT significantly mitigates visual hallucinations, achieving average reductions of 49\% for low-level and 38\% for high-level perceptual hallucinations. Furthermore, ConsistentRFT outperforms other RFT methods on out-of-domain metrics, showing an improvement of 5.1\% (v.s. the baseline's decrease of -0.4\%) over FLUX1.dev. This is \href{https://xiaofeng-tan.github.io/projects/ConsistentRFT}{Project Page}.
翻译:基于流模型的强化微调对于偏好对齐至关重要。然而,这类方法常会引入视觉幻觉,例如过度优化的细节和语义错位。本文初步探讨了视觉幻觉产生的原因及减少其影响的方法。我们首先从统一视角考察了RFT方法,揭示了源于探索与利用两方面的核心问题:(1) 随机微分方程展开过程中的有限探索,导致模型过度关注局部细节而牺牲全局语义;(2) 策略梯度方法固有的轨迹模仿过程,扭曲了模型的基础向量场及其跨步一致性。基于此,我们提出了ConsistentRFT这一通用框架以缓解这些幻觉。具体而言,我们设计了动态粒度展开机制,通过动态调度不同噪声源来平衡全局语义与局部细节的探索。随后,我们引入了一致性策略梯度优化方法,通过将当前策略与更稳定的先验对齐来保持模型的一致性。大量实验表明,ConsistentRFT能显著减轻视觉幻觉,在低层和高层感知幻觉上分别实现了平均49%和38%的降低。此外,ConsistentRFT在领域外指标上优于其他RFT方法,相较于FLUX1.dev的-0.4%下降,实现了5.1%的提升。项目页面见\href{https://xiaofeng-tan.github.io/projects/ConsistentRFT}{Project Page}。