Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.
翻译:文本到视频扩散模型已取得显著进展。基于其生成时序一致视频的能力,利用这些基础模型进行零样本视频编辑的研究迅速扩展。为提高编辑质量,视频编辑中常采用结构控制技术。在这些方法中,跨注意力掩码控制因其高效性而备受关注。然而,当跨注意力掩码被简单应用于视频编辑时,可能引发模糊和闪烁等伪影。我们的实验揭示了先前视频编辑研究中被忽视的关键因素:跨注意力掩码的清晰度并非恒定,而是随模型结构和去噪时间步长动态变化。为解决该问题,我们提出量化这种变异性的指标——掩码匹配成本(MMC),并开发了FreeMask方法,该方法可为特定视频编辑任务选择最优掩码。利用MMC选定的掩码,我们进一步改进了综合注意力特征(如时序、跨注意力和自注意力模块)中的掩码融合机制。本方法可无缝集成至现有零样本视频编辑框架,在无需控制辅助或参数微调的情况下实现更优性能,同时支持通过掩码精度控制自适应解耦未编辑的语义布局。大量实验表明,与前沿方法相比,FreeMask在语义保真度、时序一致性和编辑质量方面均表现出显著优势。