Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.
翻译:将文本到视频扩散模型与人类偏好对齐对于生成高质量视频至关重要。现有的直接偏好优化(DPO)方法依赖于多样本排序和任务特定的评判模型,效率低下且常产生模糊的全局监督。为克服这些局限,我们提出LocalDPO,一种新颖的后训练框架,该框架从真实视频构建局部偏好对,并在时空区域级别优化对齐。我们设计了一个自动化流程来高效收集偏好对数据,该流程仅需每次提示单次推理即可生成偏好对,无需外部评判模型或人工标注。具体而言,我们将高质量真实视频视为正样本,并通过随机时空掩码对其进行局部破坏,并仅使用冻结的基础模型恢复掩码区域,从而生成对应的负样本。在训练过程中,我们引入了一种区域感知的DPO损失,将偏好学习限制在破坏区域以实现快速收敛。在Wan2.1和CogVideoX上的实验表明,相较于其他后训练方法,LocalDPO能持续提升视频保真度、时间连贯性和人类偏好评分,为视频生成器对齐建立了一种更高效、更细粒度的范式。