The remarkable progress in text-to-video diffusion models enables the generation of photorealistic videos, although the content of these generated videos often includes unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some measure of the content's goodness. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select a better diffusion latent to maximize a given alignment reward at inference time. We then point out that improving perceptual video quality with respect to alignment to prompts requires reward calibration by weighting existing metrics. This is because when humans or vision language models evaluate outputs, many previous metrics to quantify the naturalness of video do not always correlate with the evaluation. We demonstrate that our method improves the perceptual quality evaluated on the calibrated reward, VLMs, and human assessment, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling under much more efficient computational cost. The experiments highlight that our method is beneficial to many capable generative models, and provide a practical guideline: we should prioritize the inference-time compute allocation into enabling the lookahead estimator and increasing the search budget, rather than expanding the denoising steps.
翻译:文本到视频扩散模型的显著进展使得生成逼真视频成为可能,然而这些生成视频的内容常包含不自然的运动或形变、反向播放以及静止场景。近期,对齐问题引起了广泛关注,即基于内容质量的某种度量来引导扩散模型的输出。由于沿帧方向的感知质量存在较大改进空间,我们需探讨在视频生成中应优化哪些指标以及如何优化。本文提出带前瞻估计器的扩散潜在束搜索方法,能够在推理时选择更优的扩散潜在表示以最大化给定的对齐奖励。我们进一步指出,提升符合提示词对齐的感知视频质量需要对现有指标进行加权奖励校准。这是因为当人类或视觉语言模型评估输出时,许多先前用于量化视频自然度的指标并不总是与评估结果相关。实验证明,我们的方法在无需更新模型参数的情况下,基于校准奖励、视觉语言模型和人工评估的感知质量均得到提升,并且在远低于贪婪搜索和N选一采样计算成本的条件下,输出最优生成结果。实验结果表明,该方法对多种高性能生成模型均具有增益效果,并提供了实用指导原则:应优先将推理时计算资源分配于启用前瞻估计器和增加搜索预算,而非扩展去噪步骤。