Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
翻译:近期文本到图像生成技术的进展极大地提升了视觉保真度与创造性,但也对提示词的复杂性提出了更高要求——尤其是在编码复杂空间关系方面。在此类情况下,获得令人满意的结果通常需要多次采样尝试。为应对这一挑战,我们提出了一种增强现有图像生成模型空间理解能力的新方法。我们首先构建了包含超过8万个偏好对的SpatialReward数据集。基于此数据集,我们建立了SpatialScore奖励模型,该模型旨在评估文本到图像生成中空间关系的准确性,其性能在空间评估任务上甚至超越了领先的专有模型。我们进一步证明,该奖励模型能有效支持复杂空间生成任务的在线强化学习。在多个基准测试上的大量实验表明,我们专门设计的奖励模型在图像生成的空间理解方面带来了显著且一致的性能提升。