MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling

Diffusion-based video generation has achieved significant progress, yet generating multiple actions that occur sequentially remains a formidable task. Directly generating a video with sequential actions can be extremely challenging due to the scarcity of fine-grained action annotations and the difficulty in establishing temporal semantic correspondences and maintaining long-term consistency. To tackle this, we propose an intuitive and straightforward solution: splicing multiple single-action video segments sequentially. The core challenge lies in generating smooth and natural transitions between these segments given the inherent complexity and variability of action transitions. We introduce MAVIN (Multi-Action Video INfilling model), designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence. MAVIN incorporates several innovative techniques to address challenges in the transition video infilling task. Firstly, a consecutive noising strategy coupled with variable-length sampling is employed to handle large infilling gaps and varied generation lengths. Secondly, boundary frame guidance (BFG) is proposed to address the lack of semantic guidance during transition generation. Lastly, a Gaussian filter mixer (GFM) dynamically manages noise initialization during inference, mitigating train-test discrepancy while preserving generation flexibility. Additionally, we introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics. Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions compared to existing methods.

翻译：基于扩散模型的视频生成已取得显著进展，但生成按顺序发生的多个动作仍然是一项艰巨的任务。由于细粒度动作标注的稀缺性、建立时序语义对应关系的困难以及保持长期一致性的挑战，直接生成包含连续动作的视频极具难度。为解决这一问题，我们提出一种直观而直接的解决方案：将多个单动作视频片段按顺序拼接。核心挑战在于，鉴于动作转换固有的复杂性和多变性，如何在片段间生成平滑自然的过渡。我们提出MAVIN（多动作视频修复模型），旨在生成能够无缝连接两个给定视频的过渡视频，形成连贯的整合序列。MAVIN融合了多项创新技术以应对过渡视频修复任务中的挑战。首先，采用连续加噪策略结合可变长度采样，以处理较大的修复间隙和不同的生成长度。其次，提出边界帧引导（BFG）机制，以解决过渡生成过程中语义引导缺失的问题。最后，高斯滤波器混合器（GFM）在推理过程中动态管理噪声初始化，在保持生成灵活性的同时缓解训练-测试差异。此外，我们引入了一项新指标CLIP-RS（CLIP相对平滑度），用于评估时序连贯性与平滑度，作为传统质量评估指标的有效补充。在马和老虎场景上的实验结果表明，相较于现有方法，MAVIN在生成平滑连贯的视频过渡方面具有更优性能。