We propose MLV-Edit, a training-free, flow-based framework that address the unique challenges of minute-level video editing. While existing techniques excel in short-form video manipulation, scaling them to long-duration videos remains challenging due to prohibitive computational overhead and the difficulty of maintaining global temporal consistency across thousands of frames. To address this, MLV-Edit employs a divide-and-conquer strategy for segment-wise editing, facilitated by two core modules: Velocity Blend rectifies motion inconsistencies at segment boundaries by aligning the flow fields of adjacent chunks, eliminating flickering and boundary artifacts commonly observed in fragmented video processing; and Attention Sink anchors local segment features to global reference frames, effectively suppressing cumulative structural drift. Extensive quantitative and qualitative experiments demonstrate that MLV-Edit consistently outperforms state-of-the-art methods in terms of temporal stability and semantic fidelity.
翻译:我们提出了MLV-Edit,一个无需训练、基于光流的框架,旨在解决分钟级视频编辑所特有的挑战。现有技术在短视频处理方面表现出色,但将其扩展到长时长视频仍然面临巨大困难,这主要源于高昂的计算开销以及难以在数千帧之间保持全局时间一致性。为解决这些问题,MLV-Edit采用了一种分而治之的分段编辑策略,该策略由两个核心模块实现:Velocity Blend通过对齐相邻视频块的光流场,校正片段边界处的运动不一致性,从而消除在分段视频处理中常见的闪烁和边界伪影;而Attention Sink则将局部片段特征锚定到全局参考帧,有效抑制累积的结构漂移。大量的定量与定性实验表明,MLV-Edit在时间稳定性和语义保真度方面持续优于现有最先进方法。