Reward models for text-to-video (T2V) generation guide post-training but often fail at fine-grained semantic alignment. We trace this to two structural weaknesses in existing reasoning-based reward models: they do not systematically verify every condition described in the prompt, and the visual evidence supporting each judgment remains implicit in their free-form reasoning. We propose SG-PVR, a video reward model that addresses these limitations through plan-and-verify reasoning grounded in spatio-temporal scene graphs. The verification plan decomposes the prompt into atomic claims, ensuring every requirement is checked. The spatio-temporal scene graph, encoding entities, attributes, and temporally-grounded relations, is extracted from the video and maintained as a persistent structured visual reference throughout reasoning. Each claim is verified against both the video and the scene graph, anchoring judgments in explicit visual evidence. SG-PVR achieves strong performance on semantic alignment, including fine-grained temporal semantics. As a test-time reranker, it further enhances compositional alignment in T2V generation.
翻译:文本到视频(T2V)生成的奖励模型虽能指导后训练过程,但在细粒度语义对齐方面常表现不佳。我们将其归因于现有基于推理的奖励模型的两类结构性缺陷:既未能系统验证提示词中描述的每项条件,又因视觉证据在自由形式推理中隐式存在而导致判断依据不明。为此提出SG-PVR视频奖励模型,通过基于时空场景图的规划与验证推理克服上述局限。验证规划将提示词分解为原子断言,确保每项需求均被核查;从视频中提取的时空场景图编码了实体、属性及时间关联关系,并在整个推理过程中作为持久化结构化视觉参考。每项断言均同时针对视频及其场景图进行验证,将判断锚定于显式视觉证据。SG-PVR在语义对齐(包括细粒度时间语义)方面表现优异,作为测试时重排序器可进一步增强T2V生成中的组合对齐能力。