OnlineVPO: Align Video Diffusion Model with Online Video-Centric Preference Optimization

Video diffusion models (VDMs) have demonstrated remarkable capabilities in text-to-video (T2V) generation. Despite their success, VDMs still suffer from degraded image quality and flickering artifacts. To address these issues, some approaches have introduced preference learning to exploit human feedback to enhance the video generation. However, these methods primarily adopt the routine in the image domain without an in-depth investigation into video-specific preference optimization. In this paper, we reexamine the design of the video preference learning from two key aspects: feedback source and feedback tuning methodology, and present OnlineVPO, a more efficient preference learning framework tailored specifically for VDMs. On the feedback source, we found that the image-level reward model commonly used in existing methods fails to provide a human-aligned video preference signal due to the modality gap. In contrast, video quality assessment (VQA) models show superior alignment with human perception of video quality. Building on this insight, we propose leveraging VQA models as a proxy of humans to provide more modality-aligned feedback for VDMs. Regarding the preference tuning methodology, we introduce an online DPO algorithm tailored for VDMs. It not only enjoys the benefits of superior scalability in optimizing videos with higher resolution and longer duration compared with the existing method, but also mitigates the insufficient optimization issue caused by off-policy learning via online preference generation and curriculum preference update designs. Extensive experiments on the open-source video-diffusion model demonstrate OnlineVPO as a simple yet effective and, more importantly, scalable preference learning algorithm for video diffusion models.

翻译：视频扩散模型在文本到视频生成任务中展现出卓越能力。尽管取得了成功，现有模型仍存在图像质量下降与闪烁伪影等问题。为解决这些缺陷，部分研究引入偏好学习机制以利用人类反馈提升视频生成质量。然而，这些方法主要沿袭图像领域的常规范式，缺乏对视频特异性偏好优化的深入探索。本文从反馈源与反馈调优方法两个关键维度重新审视视频偏好学习的设计框架，提出专为视频扩散模型定制的高效偏好学习框架OnlineVPO。在反馈源方面，我们发现现有方法普遍采用的图像级奖励模型因模态差异而无法提供与人类对齐的视频偏好信号；相比之下，视频质量评估模型在视频质量感知方面展现出更优越的人类对齐特性。基于此洞见，我们提出利用VQA模型作为人类代理，为视频扩散模型提供更具模态一致性的反馈。在偏好调优方法层面，我们设计了面向视频扩散模型的在线DPO算法。该算法不仅具备卓越的可扩展性优势——能够优化更高分辨率与更长时长的视频（相较于现有方法），还通过在线偏好生成与课程化偏好更新设计，缓解了离策略学习导致的优化不足问题。在开源视频扩散模型上的大量实验表明，OnlineVPO是一种简洁高效且具备高度可扩展性的视频扩散模型偏好学习算法。