Many video workflows benefit from a mixture of user controls with varying granularity, from exact 4D object trajectories and camera paths to coarse text prompts, while existing video generative models are typically trained for fixed input formats. We develop Ctrl-VI, a video synthesis method that addresses this need and generates samples with high controllability for specified elements while maintaining diversity for under-specified ones. We cast the task as variational inference to approximate a composed distribution, leveraging multiple video generation backbones to account for all task constraints collectively. To address the optimization challenge, we break down the problem into step-wise KL divergence minimization over an annealed sequence of distributions, and further propose a context-conditioned factorization technique that reduces modes in the solution space to circumvent local optima. Experiments suggest that our method produces samples with improved controllability, diversity, and 3D consistency compared to prior works.
翻译:许多视频工作流程受益于不同粒度的用户控制组合,从精确的4D物体轨迹与相机路径到粗略的文本提示,而现有的视频生成模型通常针对固定输入格式进行训练。我们开发了Ctrl-VI,这是一种视频合成方法,旨在满足这一需求,对指定元素生成高可控性的样本,同时对未充分指定的元素保持多样性。我们将该任务构建为变分推断问题,以近似一个组合分布,并利用多个视频生成主干网络来共同满足所有任务约束。为解决优化挑战,我们将问题分解为在退火分布序列上逐步进行KL散度最小化,并进一步提出一种上下文条件化分解技术,该技术通过减少解空间中的模态来规避局部最优。实验表明,与先前工作相比,我们的方法生成的样本在可控性、多样性和3D一致性方面均有所提升。