RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifications to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. (3) RACCooN also plans to imagine new objects in a given video, so users simply prompt the model to receive a detailed video editing plan for complex video editing. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.

翻译：当前视频生成模型主要依赖精心编写的文本提示来完成特定任务，例如修复或风格编辑。它们需要对输入视频进行劳动密集型的文本描述，这限制了其将个人/原始视频适应用户需求的灵活性。本文提出RACCooN，一种通用且用户友好的视频-段落-视频生成框架，通过统一流程支持多种视频编辑功能，如移除、添加和修改。RACCooN包含两个主要阶段：视频到段落（V2P）和段落到视频（P2V）。在V2P阶段，我们以结构良好的自然语言自动描述视频场景，捕捉整体语境和聚焦的对象细节。随后，在P2V阶段，用户可选择性地优化这些描述以指导视频扩散模型，从而实现对输入视频的各种修改，例如移除、更改主题和/或添加新对象。所提出的方法通过以下几项重要贡献区别于其他方法：（1）RACCooN提出一种多粒度时空池化策略来生成结构良好的视频描述，无需复杂的人工标注即可捕捉广泛语境和对象细节，简化了用户基于文本的精确视频内容编辑。（2）我们的视频生成模型整合了自动生成的叙述或指令，以提升生成内容的质量和准确性。（3）RACCooN还计划在给定视频中构想新对象，因此用户只需提示模型即可获得复杂视频编辑的详细编辑方案。所提出的框架在视频到段落生成、视频内容编辑方面展现出卓越的通用能力，并可集成到其他最先进的视频生成模型中以实现进一步优化。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日