Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.

翻译：近期，基于强化学习的后训练方法——特别是分组相对策略优化（GRPO）——已成为推动文本到图像（T2I）模型进一步发展的稳健范式。然而，这类方法常易出现奖励黑客行为，即模型利用不完美奖励函数中的偏差而非产生真正的性能提升。本研究识别出归一化可能导致校准失准，而直接移除提示级标准差项虽能产生线性于优势的最优策略上升方向，但仍难以区分真实信号与噪声。为解决上述问题，我们从信息几何视角重新审视函数更新，提出超线性优势塑造（SLAS）。通过引入基于优势依赖权重的Fisher-Rao信息度量扩展，SLAS在局部策略空间中构建非线性几何结构：沿高优势方向放松约束以放大信息性更新，同时在低优势区域收紧约束以抑制虚假梯度。此外，采用批级归一化以稳定不同奖励尺度下的训练过程。大量评估表明，SLAS在多个骨干网络和基准测试中持续超越DanceGRPO基线。具体而言，该方法可实现更快的训练动态、在GenEval与UniGenBench++上提升的域外性能、增强的模型扩展鲁棒性，同时缓解奖励黑客行为并保持生成中的语义与组成保真度。