STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

翻译：摘要：现有文本到图像生成的强化学习后训练方法通常将最终图像奖励转化为单一标量优势值，并以相同强度应用于整个生成轨迹。然而，文本到图像生成天然具有时空结构：不同去噪步骤负责不同的生成阶段，而真正决定文本对齐的内容往往仅出现在图像局部区域。这种粒度不匹配导致策略更新难以聚焦于实际影响奖励的生成组件。为此，我们提出面向文本到图像扩散模型与流模型的强化学习后训练的**时空自适应奖励（STAR）分配**方法。STAR利用生成模型内部的文本-图像注意力机制，从提示词中用户真正关注的核心内容出发，构建随去噪步骤和 rollout 动态变化的空间分配图，在几乎不增加额外计算开销的情况下将相同的群体相对优势分配给更相关的隐空间区域。随后通过空间解析策略目标对这些区域施加更强的策略更新。我们以Stable Diffusion 3.5 Medium作为基础模型，在GenEval、OCR文本渲染和PickScore三项任务上进行评估。实验结果表明，STAR在无需改变外部奖励源的情况下改善了组合语义对齐、文本渲染和偏好优化，在GenEval、OCR和PickScore上分别达到$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。