Subject-driven image generation models face a fundamental trade-off between identity preservation (fidelity) and prompt adherence (editability). While online reinforcement learning (RL), specifically GPRO, offers a promising solution, we find that a naive application of GRPO leads to competitive degradation, as the simple linear aggregation of rewards with static weights causes conflicting gradient signals and a misalignment with the temporal dynamics of the diffusion process. To overcome these limitations, we propose Customized-GRPO, a novel framework featuring two key innovations: (i) Synergy-Aware Reward Shaping (SARS), a non-linear mechanism that explicitly penalizes conflicted reward signals and amplifies synergistic ones, providing a sharper and more decisive gradient. (ii) Time-Aware Dynamic Weighting (TDW), which aligns the optimization pressure with the model's temporal dynamics by prioritizing prompt-following in the early, identity preservation in the later. Extensive experiments demonstrate that our method significantly outperforms naive GRPO baselines, successfully mitigating competitive degradation. Our model achieves a superior balance, generating images that both preserve key identity features and accurately adhere to complex textual prompts.
翻译:主题驱动图像生成模型在身份保持(保真度)与提示遵循(可编辑性)之间存在根本性权衡。尽管在线强化学习(特别是GPRO方法)提供了有前景的解决方案,但我们发现朴素应用GRPO会导致竞争性退化——由于静态权重的简单线性奖励聚合会产生冲突梯度信号,并与扩散过程的时间动态特性错位。为克服这些局限,我们提出Customized-GRPO框架,其包含两项关键创新:(i)协同感知奖励塑形(SARS),一种显式惩罚冲突奖励信号并放大协同信号的非线性机制,提供更锐利且更明确的梯度;(ii)时间感知动态加权(TDW),通过优先化早期阶段的提示遵循与后期阶段的身份保持,将优化压力与模型时间动态特性对齐。大量实验表明,我们的方法显著优于朴素GRPO基线,成功缓解了竞争性退化。该模型实现了优越的平衡,既能生成保留关键身份特征的图像,又能精准遵循复杂文本提示。