A single image can convey a compelling story through logically connected visual clues, forming Chains-of-Reasoning (CoRs). We define these semantically rich images as Storytelling Images. By conveying multi-layered information that inspires active interpretation, these images enable a wide range of applications, such as illustration and cognitive screening. Despite their potential, such images are scarce and complex to create. To address this, we introduce the Storytelling Image Generation task and propose StorytellingPainter, a two-stage pipeline combining the reasoning of Large Language Models (LLMs) with Text-to-Image (T2I) synthesis. We also develop a dedicated evaluation framework assessing semantic complexity, diversity, and text-image alignment. Furthermore, given the critical role of story generation in the task, we introduce lightweight Mini-Storytellers to bridge the performance gap between small-scale and proprietary LLMs. Experimental results demonstrate the feasibility of our approaches.
翻译:单张图像可通过逻辑关联的视觉线索传达引人入胜的故事,形成推理链(CoRs)。我们将这类语义丰富的图像定义为叙事图像。通过传递激发主动解读的多层次信息,此类图像可实现插画创作、认知筛查等广泛的应用场景。尽管潜力巨大,此类图像目前稀缺且创作复杂。为此,我们提出叙事图像生成任务,并构建StorytellingPainter——一个融合大语言模型(LLMs)推理能力与文本到图像(T2I)合成的两阶段框架。同时开发了专门评估体系,用于衡量语义复杂性、多样性及图文对齐度。鉴于故事生成在本任务中的关键作用,我们进一步引入轻量级Mini-Storytellers模型,以弥合小规模模型与专有大语言模型之间的性能差距。实验结果验证了所提方法的可行性。