Recent advances in video generation have outpaced progress in video editing, which remains constrained by several limiting factors, namely: (a) the task's dependency on supervision severely limits generality, (b) an unnecessary artificial separation between the generation and editing task, and (c) the high computational costs of training a video model. In this work, we propose UES (Unlocking Universal Editing via Self-Supervision), a lightweight self-supervised fine-tuning strategy that transforms generation models into unified generation-editing systems through self-supervised semantic alignment. Our approach establishes a dual-conditioning mechanism where original video-text pairs jointly provide visual and textual semantics, enabling structured learning of intrinsic spatiotemporal correspondences. Key advantages include: (i) Universality through supervision-free adaptation to diverse editing tasks, (ii) Unification of generation and editing applicable to most text(+image)-to-video model, and (iii) Efficiency via lightweight fine-tune that reduces tunable parameters by 92.67%. To enable systematic evaluation, we introduce OmniBench-99, a comprehensive benchmark spanning 99 videos across humans/animals, environments, and objects, comprising 4 editing types and 8 scenarios. Extensive experiments show UES enables models without inherent editing capability to perform powerful and universal editing while preserving or even enhancing their original generation performance.
翻译:近期视频生成领域的进展已超越视频编辑的发展,后者仍受限于若干制约因素,具体包括:(a) 任务对监督学习的依赖严重限制了通用性,(b) 生成与编辑任务间存在不必要的人为割裂,以及(c) 训练视频模型的高昂计算成本。本研究提出UES(通过自监督实现通用编辑解锁),这是一种轻量级自监督微调策略,通过自监督语义对齐将生成模型转化为统一的生成-编辑系统。该方法建立了双重条件机制,原始视频-文本对共同提供视觉与文本语义,从而实现对内在时空对应关系的结构化学习。其核心优势包括:(i) 通过无监督适应实现跨多样化编辑任务的通用性,(ii) 适用于多数文本(+图像)-视频模型的生成与编辑统一化,以及(iii) 通过轻量级微调提升效率,可调参数减少92.67%。为建立系统化评估体系,我们提出OmniBench-99综合基准测试集,涵盖人类/动物、环境与物体三大类别的99个视频,包含4种编辑类型与8种应用场景。大量实验表明,UES能使原本不具备编辑能力的模型实现强大且通用的编辑功能,同时保持甚至提升其原始生成性能。