As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. Different from the image captioning task, visual storytelling requires not only modeling the relationships between objects in the image but also mining the connections between adjacent images. Recent approaches primarily utilize either end-to-end frameworks or multi-stage frameworks to generate relevant stories, but they usually overlook latent topic information. In this paper, in order to generate a more coherent and relevant story, we propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST). In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives. Then we apply two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story so as to refine the whole generation process. Extensive experimental results on the VIST dataset and human evaluation demonstrate that our proposed model outperforms most of the competitive models across multiple evaluation metrics.
翻译:作为跨模态任务,视觉故事生成旨在自动为有序图像序列生成故事。与图像描述任务不同,视觉故事不仅需要建模图像内部物体间的关系,还需挖掘相邻图像之间的关联。近期方法主要采用端到端框架或多阶段框架生成相关故事,但通常忽略潜在的主题信息。为生成更连贯且相关的故事,本文提出一种新方法——面向视觉故事生成的篇章感知强化网络(TARN-VIST)。具体而言,我们从视觉和语言两个维度预提取故事的主题信息,并设计两种篇章一致性强化学习奖励函数,以识别生成故事与人工标注故事之间的差异,从而优化整体生成过程。在VIST数据集上的广泛实验及人工评估结果表明,该模型在多项评估指标上优于大部分对比模型。