A video storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards, however, remains challenging which not only requires cross-modal association between high-level texts and images but also demands long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images as the video storyboard to visualize the text synopsis. We construct a MovieNet-TeViS dataset based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes manually selected from corresponding movies by considering both relevance and cinematic coherence. To benchmark the task, we present strong CLIP-based baselines and a novel VQ-Trans. VQ-Trans first encodes text synopsis and images into a joint embedding space and uses vector quantization (VQ) to improve the visual representation. Then, it auto-regressively generates a sequence of visual features for retrieval and ordering. Experimental results demonstrate that VQ-Trans significantly outperforms prior methods and the CLIP-based baselines. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work. The code and data are available at: \url{https://ruc-aimind.github.io/projects/TeViS/}
翻译:视频故事板是视频创作的路线图,由逐镜头图像组成,用于可视化文本概要中的关键情节。然而,创建视频故事板仍具挑战性,不仅需要高层文本与图像之间的跨模态关联,还要求长期推理以保证镜头间过渡流畅。在本文中,我们提出一项名为“文本概要到视频故事板(TeViS)”的新任务,旨在检索有序的图像序列作为视频故事板,以可视化文本概要。我们基于公开的MovieNet数据集构建了MovieNet-TeViS数据集。该数据集包含10,000个文本概要,每个概要配有关键帧,这些关键帧从相应电影中手动选取,同时考虑了相关性及电影连贯性。为基准测试该任务,我们提出了强力的基于CLIP的基线方法,以及一种新颖的VQ-Trans方法。VQ-Trans首先将文本概要及图像编码至联合嵌入空间,并利用向量量化(VQ)改进视觉表示;随后,它以自回归方式生成有序的视觉特征序列,用于检索与排序。实验结果表明,VQ-Trans显著优于先前方法及基于CLIP的基线方法。尽管如此,与人类表现相比仍有较大差距,表明未来工作具有广阔空间。代码与数据可通过以下链接获取:\url{https://ruc-aimind.github.io/projects/TeViS/}