Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Large text-to-video models hold immense potential for a wide range of downstream applications. However, these models struggle to accurately depict dynamic object interactions, often resulting in unrealistic movements and frequent violations of real-world physics. One solution inspired by large language models is to align generated outputs with desired outcomes using external feedback. This enables the model to refine its responses autonomously, eliminating extensive manual data collection. In this work, we investigate the use of feedback to enhance the object dynamics in text-to-video models. We aim to answer a critical question: what types of feedback, paired with which specific self-improvement algorithms, can most effectively improve text-video alignment and realistic object interactions? We begin by deriving a unified probabilistic objective for offline RL finetuning of text-to-video models. This perspective highlights how design elements in existing algorithms like KL regularization and policy projection emerge as specific choices within a unified framework. We then use derived methods to optimize a set of text-video alignment metrics (e.g., CLIP scores, optical flow), but notice that they often fail to align with human perceptions of generation quality. To address this limitation, we propose leveraging vision-language models to provide more nuanced feedback specifically tailored to object dynamics in videos. Our experiments demonstrate that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions, as confirmed by both AI and human evaluations. Notably, we observe substantial gains when using reward signals derived from AI feedback, particularly in scenarios involving complex interactions between multiple objects and realistic depictions of objects falling.

翻译：大型文本到视频模型在广泛的下游应用中具有巨大潜力。然而，这些模型难以准确描绘动态物体交互，常常导致不真实的运动并频繁违反现实世界物理规律。受大语言模型启发的一种解决方案是利用外部反馈将生成输出与期望结果对齐。这使得模型能够自主优化其响应，无需大量人工数据收集。在本研究中，我们探索利用反馈机制增强文本到视频模型中的物体动态表现。我们旨在回答一个关键问题：何种类型的反馈与哪些特定的自我改进算法相结合，能够最有效地提升文本-视频对齐度与真实的物体交互效果？我们首先推导出用于文本到视频模型离线强化学习微调的统一概率目标。这一视角揭示了现有算法（如KL正则化与策略投影）中的设计要素如何作为统一框架中的特定选择自然呈现。随后，我们运用推导出的方法优化一系列文本-视频对齐指标（例如CLIP分数、光流），但发现这些指标常与人类对生成质量的感知不一致。为解决这一局限，我们提出利用视觉-语言模型提供更精细的反馈，专门针对视频中的物体动态进行定制。实验表明，我们的方法能有效优化多种奖励函数，其中二元AI反馈对动态交互场景的视频质量提升最为显著——这一结论同时得到了AI评估与人工评估的验证。值得注意的是，在使用源自AI反馈的奖励信号时，我们观察到在涉及多物体复杂交互及物体坠落真实描绘的场景中取得了实质性进展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日