Story visualization is the transformation of narrative elements into image sequences. While existing research has primarily focused on visual contextual coherence, the deeper narrative essence of stories often remains overlooked. This limitation hinders the practical application of these approaches, as generated images frequently fail to capture the intended meaning and nuances of the narrative fully. To address these challenges, we propose VisAgent, a training-free multi-agent framework designed to comprehend and visualize pivotal scenes within a given story. By considering story distillation, semantic consistency, and contextual coherence, VisAgent employs an agentic workflow. In this workflow, multiple specialized agents collaborate to: (i) refine layered prompts based on the narrative structure and (ii) seamlessly integrate \gt{generated} elements, including refined prompts, scene elements, and subject placement, into the final image. The empirically validated effectiveness confirms the framework's suitability for practical story visualization applications.
翻译:故事可视化是将叙事元素转化为图像序列的过程。现有研究主要关注视觉上下文连贯性,而故事的深层叙事本质常被忽视。这一局限阻碍了这些方法的实际应用,因为生成的图像往往无法充分捕捉叙事的本意与细微之处。为解决这些挑战,我们提出VisAgent——一个无需训练的多智能体框架,旨在理解并可视化给定故事中的关键场景。通过综合考虑故事精炼、语义一致性与上下文连贯性,VisAgent采用智能体工作流。在该工作流中,多个专业智能体协同工作以:(i)基于叙事结构优化分层提示词;(ii)将生成要素(包括优化后的提示词、场景元素与主体布局)无缝整合至最终图像。经实证验证的有效性证实了该框架在实际故事可视化应用中的适用性。