Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
翻译:当前视频生成模型主要依赖精心编写的文本提示来完成特定任务,例如修复或风格编辑。它们需要对输入视频进行劳动密集型的文本描述,这限制了其将个人/原始视频适应用户需求的灵活性。本文提出RACCooN,一种通用且用户友好的视频-段落-视频生成框架,通过统一流程支持多种视频编辑功能,如移除、添加和修改。RACCooN包含两个主要阶段:视频到段落(V2P)和段落到视频(P2V)。在V2P阶段,我们以结构良好的自然语言自动描述视频场景,捕捉整体语境和聚焦的对象细节。随后,在P2V阶段,用户可选择性地优化这些描述以指导视频扩散模型,从而实现对输入视频的各种修改,例如移除、更改主题和/或添加新对象。所提出的方法通过以下几项重要贡献区别于其他方法:(1)RACCooN提出一种多粒度时空池化策略来生成结构良好的视频描述,无需复杂的人工标注即可捕捉广泛语境和对象细节,简化了用户基于文本的精确视频内容编辑。(2)我们的视频生成模型整合了自动生成的叙述或指令,以提升生成内容的质量和准确性。(3)RACCooN还计划在给定视频中构想新对象,因此用户只需提示模型即可获得复杂视频编辑的详细编辑方案。所提出的框架在视频到段落生成、视频内容编辑方面展现出卓越的通用能力,并可集成到其他最先进的视频生成模型中以实现进一步优化。