Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.
翻译:大型文本到视频模型在广泛的下游应用中具有巨大潜力。然而,这些模型难以准确描绘动态物体交互,常常导致不真实的运动并频繁违反现实世界物理规律。受大语言模型启发的一种解决方案是利用外部反馈将生成输出与期望结果对齐。这使得模型能够自主优化其响应,无需大量人工数据收集。在本研究中,我们探索利用反馈机制增强文本到视频模型中的物体动态表现。我们旨在回答一个关键问题:何种类型的反馈与哪些特定的自我改进算法相结合,能够最有效地提升文本-视频对齐度与真实的物体交互效果?我们首先推导出用于文本到视频模型离线强化学习微调的统一概率目标。这一视角揭示了现有算法(如KL正则化与策略投影)中的设计要素如何作为统一框架中的特定选择自然呈现。随后,我们运用推导出的方法优化一系列文本-视频对齐指标(例如CLIP分数、光流),但发现这些指标常与人类对生成质量的感知不一致。为解决这一局限,我们提出利用视觉-语言模型提供更精细的反馈,专门针对视频中的物体动态进行定制。实验表明,我们的方法能有效优化多种奖励函数,其中二元AI反馈对动态交互场景的视频质量提升最为显著——这一结论同时得到了AI评估与人工评估的验证。值得注意的是,在使用源自AI反馈的奖励信号时,我们观察到在涉及多物体复杂交互及物体坠落真实描绘的场景中取得了实质性进展。