Characters are essential to the plot of any story. Establishing the characters before writing a story can improve the clarity of the plot and the overall flow of the narrative. However, previous work on visual storytelling tends to focus on detecting objects in images and discovering relationships between them. In this approach, characters are not distinguished from other objects when they are fed into the generation pipeline. The result is a coherent sequence of events rather than a character-centric story. In order to address this limitation, we introduce the VIST-Character dataset, which provides rich character-centric annotations, including visual and textual co-reference chains and importance ratings for characters. Based on this dataset, we propose two new tasks: important character detection and character grounding in visual stories. For both tasks, we develop simple, unsupervised models based on distributional similarity and pre-trained vision-and-language models. Our new dataset, together with these models, can serve as the foundation for subsequent work on analysing and generating stories from a character-centric perspective.
翻译:角色是任何故事情节的核心要素。在撰写故事前确定角色设定,能够提升情节的清晰度和叙事流畅性。然而,现有的视觉故事生成研究往往侧重于检测图像中的物体并发现其相互关系。在这种方法下,角色与其他物体在输入生成流水线时未被区分对待,最终输出的结果侧重于连贯的事件序列,而非以角色为中心的故事。为解决这一局限,我们提出了VIST-Character数据集,该数据集提供丰富的角色中心化标注信息,包括视觉与文本的共指链以及角色重要性评分。基于该数据集,我们定义了两项新任务:视觉故事中的重要角色检测与角色 grounding(定位)。针对这两项任务,我们开发了基于分布相似性与预训练视觉-语言模型的简单无监督模型。我们的新数据集与配套模型可作为后续从角色中心化视角分析和生成故事的研究基础。