A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
翻译:故事板是视频制作的路线图,由逐镜头图像组成,以可视化文本摘要中的关键情节。然而,创建视频故事板仍然具有挑战性,这不仅需要高级文本与图像之间的关联,还需要长期推理以确保镜头之间的过渡流畅。在本文中,我们提出了一项名为“文本摘要到视频故事板”(TeViS)的新任务,旨在检索有序的图像序列以可视化文本摘要。我们基于公开的MovieNet数据集构建了MovieNet-TeViS基准,包含10K个文本摘要,每个摘要都配有从相应电影中手动选择的关键帧,这些关键帧同时考虑了相关性和电影连贯性。我们还为该任务提出了一个编码器-解码器基线模型。该模型使用预训练的视觉-语言模型来改进高级文本-图像匹配。为了提升长期镜头的连贯性,我们进一步提出在大规模无文本的电影帧上预训练解码器。实验结果表明,我们提出的模型在创建与文本相关且连贯的故事板方面显著优于其他模型。尽管如此,与人类表现相比仍存在较大差距,这为未来有前景的研究留下了空间。