Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
翻译:主体驱动图像生成模型面临一个基本权衡:身份保持(保真度)与提示遵循(可编辑性)。虽然在线强化学习(RL),特别是GRPO,提供了一个有前景的解决方案,但我们发现,直接应用GRPO会导致竞争性退化,因为使用静态权重对奖励进行简单的线性聚合会产生冲突的梯度信号,并与扩散过程的时间动态失配。为了克服这些限制,我们提出了Customized-GRPO,这是一个新颖的框架,具有两个关键创新:(i)协同感知奖励塑形(SARS),一种非线性机制,明确惩罚冲突的奖励信号并放大协同信号,从而提供更尖锐、更具决定性的梯度。(ii)时间感知动态加权(TDW),它通过在前阶段优先考虑提示遵循,在后阶段优先考虑身份保持,使优化压力与模型的时间动态保持一致。大量实验表明,我们的方法显著优于直接的GRPO基线,成功缓解了竞争性退化。我们的模型实现了更优的平衡,生成的图像既能保持关键身份特征,又能准确遵循复杂的文本提示。