Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

翻译：强化学习（RL）已成为增强图像编辑和文本到图像（T2I）生成的一种有前景的范式。然而，当前在RL过程中充当评判者的奖励模型常常存在幻觉问题并给出噪声评分，从而在根本上误导优化过程。本文提出FIRM（忠实图像奖励建模），这是一个构建鲁棒奖励模型以提供准确可靠指导的综合性框架，旨在实现忠实的图像生成与编辑。首先，我们设计了定制化的数据整理流程来构建高质量的评分数据集。具体而言，我们通过执行度和一致性来评估编辑任务，而生成任务则主要通过指令遵循度来评估。利用这些流程，我们收集了FIRM-Edit-370K和FIRM-Gen-293K数据集，并训练了能准确反映这些标准的专用奖励模型（FIRM-Edit-8B和FIRM-Gen-8B）。其次，我们引入了FIRM-Bench，这是一个专门为编辑和生成评判者设计的综合性基准测试。评估表明，与现有指标相比，我们的模型在与人判断的一致性方面达到了更优水平。此外，为了将这些评判者无缝集成到RL流程中，我们提出了一种新颖的“基础与奖励”策略，以平衡相互竞争的目标：用于编辑的一致性调制执行（CME）和用于生成的质量调制对齐（QMA）。在此框架的支持下，我们最终得到的模型FIRM-Qwen-Edit和FIRM-SD3.5实现了显著的性能突破。全面的实验证明，FIRM有效缓解了幻觉问题，在忠实度和指令遵循方面为现有通用模型树立了新的标准。我们所有的数据集、模型和代码均已公开于 https://firm-reward.github.io。