We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generative priors of text-to-video models to maintain temporal consistency and produce visually pleasing results. We propose the Sequential ControlNet with the standardized self-scaling, which effectively extracts structure layouts and adaptively captures high-contrast sketch details. We further augment the diffusion transformer backbone with the sketch attention to interpret and inject fine-grained sketch semantics. A sketch-aware encoder ensures that repainted results are aligned with the provided sketch sequence. Additionally, we contribute the VireSet, a dataset with detailed annotations tailored for training and evaluating video instance editing methods. Experimental results demonstrate the effectiveness of VIRES, which outperforms state-of-the-art methods in visual quality, temporal consistency, condition alignment, and human ratings. Project page:https://suimuc.github.io/suimu.github.io/projects/VIRES/
翻译:我们提出了VIRES,一种基于草图与文本引导的视频实例重绘方法,能够实现视频实例的重绘、替换、生成与移除。现有方法在时序一致性和与所提供草图序列的精确对齐方面存在困难。VIRES利用文本到视频模型的生成先验来保持时序一致性并产生视觉上令人愉悦的结果。我们提出了具有标准化自缩放特性的序列化ControlNet,该网络能有效提取结构布局并自适应地捕捉高对比度的草图细节。我们进一步通过草图注意力机制增强扩散Transformer主干网络,以解析并注入细粒度的草图语义。一个草图感知编码器确保重绘结果与提供的草图序列对齐。此外,我们贡献了VireSet数据集,这是一个包含详细标注的数据集,专为训练和评估视频实例编辑方法而设计。实验结果表明,VIRES在视觉质量、时序一致性、条件对齐和人工评分方面均优于现有最先进方法,验证了其有效性。项目页面:https://suimuc.github.io/suimu.github.io/projects/VIRES/