Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.
翻译:文本到视频生成的计算成本高昂,因此通常每个提示仅生成少量样本。在这种低样本条件下,最大化每批次的价值需要极高的跨视频多样性。现有方法在图像生成中提升了多样性,但对于视频生成,这些方法往往会损害视频内的时间一致性,且需要通过视频解码器进行昂贵的反向传播。我们提出一种面向流匹配视频生成器的联合采样框架,该框架在提升批次多样性的同时保持时间一致性。我们的方法先应用多样性驱动的更新,随后仅移除可能降低时间一致性目标的分量。为避免图像空间梯度计算,我们使用轻量级潜空间模型评估这两类目标,从而规避视频解码和解码器反向传播。在先进的文本到视频流匹配模型上的实验表明,本方法在达到与强联合采样基线相当的多样性的同时,显著提升了时间一致性与色彩自然度。代码将公开。