A proper evaluation of stories generated for a sequence of images -- the task commonly referred to as visual storytelling -- must consider multiple aspects, such as coherence, grammatical correctness, and visual grounding. In this work, we focus on evaluating the degree of grounding, that is, the extent to which a story is about the entities shown in the images. We analyze current metrics, both designed for this purpose and for general vision-text alignment. Given their observed shortcomings, we propose a novel evaluation tool, GROOViST, that accounts for cross-modal dependencies, temporal misalignments (the fact that the order in which entities appear in the story and the image sequence may not match), and human intuitions on visual grounding. An additional advantage of GROOViST is its modular design, where the contribution of each component can be assessed and interpreted individually.
翻译:对图像序列生成的故事(即视觉故事生成任务)进行合理评估需综合考虑连贯性、语法正确性及视觉接地等多个维度。本文聚焦于对接地程度的评估,即故事描述内容与图像中实体间关联的紧密程度。我们分析了现有指标(包括专为此任务设计的指标及通用视觉-文本对齐指标)的局限性,继而提出新型评估工具GROOViST。该工具可充分建模跨模态依赖关系、时间错位现象(故事中实体出现顺序与图像序列可能不一致)以及人类对视觉接地任务的直觉认知。此外,GROOViST采用模块化设计,可独立评估各模块的贡献度并解释其作用机制。