Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The project page is https://jianzongwu.github.io/projects/diffsensei/.
翻译:故事可视化,即根据文本描述创建视觉叙事,已随着文本到图像生成模型的发展取得进展。然而,这些模型通常缺乏对角色外观与交互的有效控制,尤其是在多角色场景中。为应对这些局限,我们提出一项新任务:**定制化漫画生成**,并引入**DiffSensei**——一个专为生成具有动态多角色控制的漫画而设计的创新框架。DiffSensei将基于扩散的图像生成器与多模态大语言模型(MLLM)相结合,后者充当文本兼容的身份适配器。我们的方法采用掩码交叉注意力机制无缝整合角色特征,实现精确的布局控制而无需直接像素迁移。此外,基于MLLM的适配器会调整角色特征以匹配画格特定的文本线索,从而允许灵活调整角色的表情、姿态和动作。我们还引入了**MangaZero**,一个为此任务定制的大规模数据集,包含43,264页漫画和427,147个标注画格,支持在连续帧中可视化多样的角色交互与动作。大量实验表明,DiffSensei优于现有模型,通过实现文本可适应的角色定制,标志着漫画生成领域的重大进展。项目页面为 https://jianzongwu.github.io/projects/diffsensei/。