A video storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards, however, remains challenging which not only requires cross-modal association between high-level texts and images but also demands long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis. We construct a MovieNet-TeViS dataset based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes manually selected from corresponding movies by considering both relevance and cinematic coherence. To benchmark the task, we present strong CLIP-based baselines and a novel VQ-Trans. VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation. Then, it auto-regressively generates a sequence of visual features for retrieval and ordering. Experimental results demonstrate that VQ-Trans significantly outperforms prior methods and the CLIP-based baselines. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work. The code and data are available at: \url{https://ruc-aimind.github.io/projects/TeViS/}
翻译:视频故事板是视频创作的蓝图,由逐镜头图像组成,用于可视化文本概要中的关键情节。然而,创建视频故事板仍具挑战性,不仅需要高级文本与图像之间的跨模态关联,还要求长期推理以确保镜头间的平滑过渡。本文提出了一项名为“文本概要到视频故事板”(TeViS)的新任务,旨在检索有序图像序列作为视频故事板,以可视化文本概要。基于公开的MovieNet数据集,我们构建了MovieNet-TeViS数据集,包含1万条文本概要,每条均与从对应电影中手动选取的关键帧配对,同时考虑了相关性和电影连贯性。为基准测试该任务,我们提出了强基线CLIP模型以及新颖的VQ-Trans模型。VQ-Trans首先将文本概要与图像编码至联合嵌入空间,并利用向量量化(VQ)改进视觉表示;随后,它自回归生成用于检索与排序的视觉特征序列。实验结果表明,VQ-Trans显著优于先前方法和基于CLIP的基线模型。然而,其性能与人类表现仍存在较大差距,表明未来工作具有提升空间。代码与数据见:\url{https://ruc-aimind.github.io/projects/TeViS/}