Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.
翻译:近期,通过强化学习进行文本到图像(T2I)生成的研究进展得益于能够评估语义对齐和视觉质量的奖励模型。然而,大多数现有奖励模型对细粒度空间关系的关注有限,常生成整体看似合理但物体定位存在偏差的图像。本文提出可验证的奖励模型\textbf{SpatialReward},专门用于评估生成图像中的空间布局。SpatialReward采用多阶段流水线:\emph{提示分解器}从自由形式提示中提取实体、属性和空间元数据;专家检测器提供物体位置和属性的精准视觉定位;视觉语言模型基于定位观察进行链式推理,以评估基于规则方法难以处理的复杂空间关系。为更全面评估生成图像中的空间关系,我们引入覆盖物体属性、方向、物体间关系及渲染文字位置的基准\textbf{SpatRelBench}。在Stable Diffusion和FLUX上的实验表明,将SpatialReward融入强化学习训练可一致提升空间一致性和整体生成质量,结果与人类判断高度吻合。这些发现表明,可验证的奖励模型在实现文本到图像生成模型更精准可控的优化方面具有巨大潜力。