Recent advances in video reward models and post-training strategies have improved text-to-video (T2V) generation. While these models typically assess visual quality, motion quality, and text alignment, they often overlook key structural distortions, such as abnormal object appearances and interactions, which can degrade the overall quality of the generative video. To address this gap, we introduce REACT, a frame-level reward model designed specifically for structural distortions evaluation in generative videos. REACT assigns point-wise scores and attribution labels by reasoning over video frames, focusing on recognizing distortions. To support this, we construct a large-scale human preference dataset, annotated based on our proposed taxonomy of structural distortions, and generate additional data using a efficient Chain-of-Thought (CoT) synthesis pipeline. REACT is trained with a two-stage framework: ((1) supervised fine-tuning with masked loss for domain knowledge injection, followed by (2) reinforcement learning with Group Relative Policy Optimization (GRPO) and pairwise rewards to enhance reasoning capability and align output scores with human preferences. During inference, a dynamic sampling mechanism is introduced to focus on frames most likely to exhibit distortion. We also present REACT-Bench, a benchmark for generative video distortion evaluation. Experimental results demonstrate that REACT complements existing reward models in assessing structutal distortion, achieving both accurate quantitative evaluations and interpretable attribution analysis.
翻译:近年来,视频奖励模型与后训练策略的进展提升了文本到视频(T2V)生成的质量。尽管这些模型通常评估视觉质量、运动质量和文本对齐度,但它们往往忽略了关键的结构性失真,例如异常的对象外观与交互,这些失真会降低生成视频的整体质量。为填补这一空白,我们提出了REACT,一个专门为生成式视频中结构性失真评估而设计的帧级奖励模型。REACT通过对视频帧进行推理来分配逐点分数和归因标签,其重点在于识别失真。为此,我们构建了一个大规模的人类偏好数据集,该数据集基于我们提出的结构性失真分类法进行标注,并利用高效的思维链(CoT)合成流程生成了额外数据。REACT采用两阶段框架进行训练:(1)使用掩码损失进行监督微调以注入领域知识,随后(2)采用组相对策略优化(GRPO)和成对奖励进行强化学习,以增强推理能力并使输出分数与人类偏好对齐。在推理阶段,引入了一种动态采样机制,以聚焦于最可能出现失真的帧。我们还提出了REACT-Bench,一个用于生成式视频失真评估的基准测试。实验结果表明,REACT在评估结构性失真方面对现有奖励模型形成了有效补充,实现了精确的定量评估和可解释的归因分析。