Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos. However, achieving the desired output often entails multiple revisions and iterative inference to refine user-provided prompts. Current automatic methods for refining prompts encounter challenges such as Modality-Inconsistency, Cost-Discrepancy, and Model-Unaware when applied to text-to-video diffusion models. To address these problem, we introduce an LLM-based prompt adaptation framework, termed as Prompt-A-Video, which excels in crafting Video-Centric, Labor-Free and Preference-Aligned prompts tailored to specific video diffusion model. Our approach involves a meticulously crafted two-stage optimization and alignment system. Initially, we conduct a reward-guided prompt evolution pipeline to automatically create optimal prompts pool and leverage them for supervised fine-tuning (SFT) of the LLM. Then multi-dimensional rewards are employed to generate pairwise data for the SFT model, followed by the direct preference optimization (DPO) algorithm to further facilitate preference alignment. Through extensive experimentation and comparative analyses, we validate the effectiveness of Prompt-A-Video across diverse generation models, highlighting its potential to push the boundaries of video generation.
翻译:文本到视频模型通过对高质量文本-视频对的优化取得了显著进展,其中文本提示在决定输出视频质量方面起着关键作用。然而,要获得期望的输出,通常需要对用户提供的提示进行多次修改和迭代推理。当前用于优化提示的自动化方法在应用于文本到视频扩散模型时,面临着模态不一致、成本差异和模型不感知等挑战。为解决这些问题,我们引入了一个基于大语言模型的提示适配框架,称为Prompt-A-Video,它擅长为特定视频扩散模型量身定制以视频为中心、无需人工干预且偏好对齐的提示。我们的方法包含一个精心设计的两阶段优化与对齐系统。首先,我们执行一个基于奖励的提示进化流程,以自动创建最优提示池,并利用它们对大语言模型进行监督微调。随后,采用多维度奖励为监督微调模型生成成对数据,并应用直接偏好优化算法以进一步促进偏好对齐。通过广泛的实验和比较分析,我们在多种生成模型上验证了Prompt-A-Video的有效性,突显了其在推动视频生成边界方面的潜力。