Edit3K: Universal Representation Learning for Video Editing Components

This paper focuses on understanding the predominant video creation pipeline, i.e., compositional video editing with six main types of editing components, including video effects, animation, transition, filter, sticker, and text. In contrast to existing visual representation learning of visual materials (i.e., images/videos), we aim to learn visual representations of editing actions/components that are generally applied on raw materials. We start by proposing the first large-scale dataset for editing components of video creation, which covers about $3,094$ editing components with $618,800$ videos. Each video in our dataset is rendered by various image/video materials with a single editing component, which supports atomic visual understanding of different editing components. It can also benefit several downstream tasks, e.g., editing component recommendation, editing component recognition/retrieval, etc. Existing visual representation methods perform poorly because it is difficult to disentangle the visual appearance of editing components from raw materials. To that end, we benchmark popular alternative solutions and propose a novel method that learns to attend to the appearance of editing components regardless of raw materials. Our method achieves favorable results on editing component retrieval/recognition compared to the alternative solutions. A user study is also conducted to show that our representations cluster visually similar editing components better than other alternatives. Furthermore, our learned representations used to transition recommendation tasks achieve state-of-the-art results on the AutoTransition dataset. The code and dataset are available at https://github.com/GX77/Edit3K .

翻译：本文聚焦于理解主流的视频创作流程，即包含六种主要编辑组件类型（视频特效、动画、转场、滤镜、贴纸和文本）的组合式视频编辑。与现有针对视觉素材（即图像/视频）的视觉表示学习不同，我们的目标是学习通常应用于原始素材的编辑操作/组件的视觉表示。我们首先构建了首个面向视频创作编辑组件的大规模数据集，涵盖约3,094个编辑组件和618,800个视频。数据集中每个视频均由各类图像/视频素材通过单一编辑组件渲染生成，支持对不同编辑组件的原子化视觉理解。该数据集也可惠及多个下游任务，如编辑组件推荐、编辑组件识别/检索等。现有视觉表示方法因难以将编辑组件的视觉外观与原始素材解耦而表现不佳。为此，我们对现有流行替代方案进行基准测试，并提出一种新颖方法，该方法学习聚焦于编辑组件的外观特征而不受原始素材干扰。在编辑组件检索/识别任务上，我们的方法相比替代方案取得了更优结果。用户研究也表明，我们的表示方法能比其他方案更好地聚类视觉相似的编辑组件。此外，将我们学习到的表示应用于转场推荐任务时，在AutoTransition数据集上取得了最先进的结果。代码与数据集公开于https://github.com/GX77/Edit3K。