TaleCrafter: Interactive Story Visualization with Multiple Characters

Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images. Most previous works endeavor to meet these requirements by fitting a text-to-image (T2I) model on a set of videos in the same style and with the same characters, e.g., the FlintstonesSV dataset. However, the learned T2I models typically struggle to adapt to new characters, scenes, and styles, and often lack the flexibility to revise the layout of the synthesized images. This paper proposes a system for generic interactive story visualization, capable of handling multiple novel characters and supporting the editing of layout and local structure. It is developed by leveraging the prior knowledge of large language and T2I models, trained on massive corpora. The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-layout generation (T2L), controllable text-to-image generation (C-T2I), and image-to-video animation (I2V). First, the S2P module converts concise story information into detailed prompts required for subsequent stages. Next, T2L generates diverse and reasonable layouts based on the prompts, offering users the ability to adjust and refine the layout to their preference. The core component, C-T2I, enables the creation of images guided by layouts, sketches, and actor-specific identifiers to maintain consistency and detail across visualizations. Finally, I2V enriches the visualization process by animating the generated images. Extensive experiments and a user study are conducted to validate the effectiveness and flexibility of interactive editing of the proposed system.

翻译：准确的故事可视化需要多个必要元素，例如跨帧的角色身份一致性、纯文本与视觉内容的对齐，以及图像中对象的合理布局。先前的大多数工作通过在同一风格且包含相同角色的一组视频上（如FlintstonesSV数据集）拟合文本到图像（T2I）模型来满足这些需求。然而，学习得到的T2I模型通常难以适应新角色、场景与风格，且往往缺乏调整合成图像布局的灵活性。本文提出一种面向通用交互式故事可视化的系统，能够处理多个新颖角色并支持布局与局部结构的编辑。该系统利用基于大规模语料训练的大语言模型与T2I模型的先验知识进行开发，由四个相互关联的组件构成：故事到提示生成（S2P）、文本到布局生成（T2L）、可控文本到图像生成（C-T2I）以及图像到视频动画（I2V）。首先，S2P模块将简洁的故事信息转换为后续阶段所需的详细提示。接着，T2L基于提示生成多样且合理的布局，使用户能够根据偏好调整和优化布局。核心组件C-T2I可在布局、草图及角色特定标识的引导下生成图像，以保持可视化的一致性与细节。最后，I2V通过对生成图像进行动画化来丰富可视化流程。通过大量实验与用户研究，验证了所提系统在交互式编辑中的有效性与灵活性。