Although existing video editing methods are generally feasible, they often require many costly iterations and still struggle to deliver high-quality yet satisfying editing results. We attribute this limitation to the prevalent data-to-data paradigm, which is less compatible with modern generative models than noise-to-data generation. To address this gap, we revisit video editing from a noise-to-data perspective and propose Streaming-Generation-based Video Editing (StreamGVE), which preserves few-step sampling while seamlessly injecting source-video conditions. Built on pre-trained streaming generation models, StreamGVE introduces dual-branch fast sampling with a self-attention bridge and cross-attention grounding/boosting to satisfy both sampling and conditioning requirements. We further propose source-oriented guidance to improve target-generation quality, and a visual prompting strategy to enhance editing flexibility and practicality. The method is effective, robust, and generalizable across different models. Extensive experiments on diverse video editing tasks show that StreamGVE consistently outperforms existing approaches, even in few-step settings with minimal time cost.
翻译:尽管现有的视频编辑方法通常可行,但它们往往需要大量昂贵的迭代过程,且仍难以获得高质量且令人满意的编辑结果。我们将这一局限归因于当前普遍采用的数据到数据范式,该范式与噪声到数据的生成相比,对现代生成模型的兼容性较差。为解决这一问题,我们从噪声到数据的视角重新审视视频编辑,提出基于流式生成的视频编辑方法(StreamGVE),该方法在保持少步采样的同时无缝注入源视频条件。基于预训练的流式生成模型,StreamGVE引入了具有自注意力桥接和交叉注意力接地/增强的双分支快速采样,以满足采样和条件注入的双重需求。我们进一步提出源导向引导策略以提升目标生成质量,并提出视觉提示策略以增强编辑的灵活性和实用性。该方法在不同模型上均表现出有效性、鲁棒性和泛化能力。在多种视频编辑任务上的大量实验表明,即使采用最少时间开销的少步设置,StreamGVE仍持续优于现有方法。