DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with panel-specific text cues, allowing flexible adjustments in character expressions, poses, and actions. We also introduce \textbf{MangaZero}, a large-scale dataset tailored to this task, containing 43,264 manga pages and 427,147 annotated panels, supporting the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The project page is https://jianzongwu.github.io/projects/diffsensei/.

翻译：故事可视化，即根据文本描述创建视觉叙事，已随着文本到图像生成模型的发展取得进展。然而，这些模型通常缺乏对角色外观与交互的有效控制，尤其是在多角色场景中。为应对这些局限，我们提出一项新任务：**定制化漫画生成**，并引入**DiffSensei**——一个专为生成具有动态多角色控制的漫画而设计的创新框架。DiffSensei将基于扩散的图像生成器与多模态大语言模型（MLLM）相结合，后者充当文本兼容的身份适配器。我们的方法采用掩码交叉注意力机制无缝整合角色特征，实现精确的布局控制而无需直接像素迁移。此外，基于MLLM的适配器会调整角色特征以匹配画格特定的文本线索，从而允许灵活调整角色的表情、姿态和动作。我们还引入了**MangaZero**，一个为此任务定制的大规模数据集，包含43,264页漫画和427,147个标注画格，支持在连续帧中可视化多样的角色交互与动作。大量实验表明，DiffSensei优于现有模型，通过实现文本可适应的角色定制，标志着漫画生成领域的重大进展。项目页面为 https://jianzongwu.github.io/projects/diffsensei/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日