Album Storytelling with Iterative Story-aware Captioning and Large Language Models

This work studies how to transform an album to vivid and coherent stories, a task we refer to as "album storytelling''. While this task can help preserve memories and facilitate experience sharing, it remains an underexplored area in current literature. With recent advances in Large Language Models (LLMs), it is now possible to generate lengthy, coherent text, opening up the opportunity to develop an AI assistant for album storytelling. One natural approach is to use caption models to describe each photo in the album, and then use LLMs to summarize and rewrite the generated captions into an engaging story. However, we find this often results in stories containing hallucinated information that contradicts the images, as each generated caption ("story-agnostic") is not always about the description related to the whole story or miss some necessary information. To address these limitations, we propose a new iterative album storytelling pipeline. Specifically, we start with an initial story and build a story-aware caption model to refine the captions using the whole story as guidance. The polished captions are then fed into the LLMs to generate a new refined story. This process is repeated iteratively until the story contains minimal factual errors while maintaining coherence. To evaluate our proposed pipeline, we introduce a new dataset of image collections from vlogs and a set of systematic evaluation metrics. Our results demonstrate that our method effectively generates more accurate and engaging stories for albums, with enhanced coherence and vividness.

翻译：本研究探讨如何将相册转化为生动且连贯的故事，我们将此任务称为“相册叙事”。尽管该任务有助于保存记忆并促进经验分享，但在现有文献中仍属于尚未充分探索的领域。随着大语言模型（LLMs）的最新进展，如今已能够生成长篇连贯文本，这为开发相册叙事的AI助手创造了机遇。一种直观的方法是利用字幕模型描述相册中每张照片，再借助LLMs对生成的字幕进行总结与改写，形成引人入胜的故事。然而，我们发现这种方式常导致故事中出现与图像矛盾的幻觉信息，因为生成的每条“故事无关”字幕并非始终与整体故事描述相关，或遗漏必要信息。为克服这些局限，我们提出了一种新的迭代式相册叙事流程。具体而言，首先以初始故事为起点，构建故事感知字幕模型，利用整体故事作为指导来优化字幕。随后将精炼后的字幕输入LLMs，生成新的改进故事。此过程迭代重复，直至故事在保持连贯性的同时将事实错误降至最低。为评估所提流程，我们引入了一个来自视频博客的图像集合新数据集及一套系统化评估指标。实验结果表明，我们的方法能够有效为相册生成更准确、更引人入胜的故事，并显著提升连贯性与生动性。