Visual Story-Telling is the process of forming a multi-sentence story from a set of images. Appropriately including visual variation and contextual information captured inside the input images is one of the most challenging aspects of visual storytelling. Consequently, stories developed from a set of images often lack cohesiveness, relevance, and semantic relationship. In this paper, we propose a novel Vision Transformer Based Model for describing a set of images as a story. The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT). Firstly, input images are divided into 16X16 patches and bundled into a linear projection of flattened patches. The transformation from a single image to multiple image patches captures the visual variety of the input visual patterns. These features are used as input to a Bidirectional-LSTM which is part of the sequence encoder. This captures the past and future image context of all image patches. Then, an attention mechanism is implemented and used to increase the discriminatory capacity of the data fed into the language model, i.e. a Mogrifier-LSTM. The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST), and the results show that our model outperforms the current state of the art models.
翻译:视觉故事讲述(Visual Story-Telling)是从一组图像中生成多句故事的过程。如何恰当融合输入图像中的视觉变化和上下文信息,是视觉故事讲述中最具挑战性的环节之一。因此,从图像集生成的故事常缺乏连贯性、相关性和语义关联。本文提出一种新颖的基于视觉Transformer(Vision Transformer, ViT)的图像集故事描述模型。该方法利用ViT提取输入图像的显著特征:首先将输入图像分割为16×16的图块,并将其线性投影为展平的图块序列。将单张图像转化为多图块的操作充分捕捉了输入视觉模式的多样性。这些特征被输入到双向长短期记忆网络(Bidirectional-LSTM)中,该网络作为序列编码器的一部分,能够捕捉所有图像图块在时间上的前后上下文信息。随后引入注意力机制,增强输入至语言模型(即Mogrifier-LSTM)数据的判别能力。在视觉故事讲述数据集(VIST)上的评估表明,本模型性能优于当前最先进的模型。