We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.
翻译:我们通过比较人类撰写的叙事与视觉语言模型(VLM)在视觉写作提示语料库上生成的叙事,研究视觉驱动故事中的叙事连贯性。利用一组捕捉叙事连贯性不同方面的度量指标(包括共指、话语关系类型、主题连续性、角色持久性及多模态角色锚定),我们计算了叙事连贯性得分。研究发现,VLM生成的叙事展现出大体类似的连贯性特征,但与人类叙事存在系统性差异。此外,各单项度量指标的差异通常较细微,但在联合考量时变得更为清晰。总体而言,我们的结果表明,尽管模型叙事具有类人的表面流畅度,但在视觉驱动故事中,它们的话语组织方式与人类存在系统性差异。相关代码已发布于 https://github.com/GU-CLASP/coherence-driven-humans。