This paper introduces MultiBooth, a novel and efficient technique for multi-concept customization in image generation from text. Despite the significant advancements in customized generation methods, particularly with the success of diffusion models, existing methods often struggle with multi-concept scenarios due to low concept fidelity and high inference cost. MultiBooth addresses these issues by dividing the multi-concept generation process into two phases: a single-concept learning phase and a multi-concept integration phase. During the single-concept learning phase, we employ a multi-modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi-concept integration phase, we use bounding boxes to define the generation area for each concept within the cross-attention map. This method enables the creation of individual concepts within their specified regions, thereby facilitating the formation of multi-concept images. This strategy not only improves concept fidelity but also reduces additional inference cost. MultiBooth surpasses various baselines in both qualitative and quantitative evaluations, showcasing its superior performance and computational efficiency. Project Page: https://multibooth.github.io/
翻译:本文介绍MultiBooth,一种新颖且高效的多概念定制化文本到图像生成技术。尽管定制化生成方法,特别是基于扩散模型的方法取得了显著进展,现有方法在多概念场景中常因概念保真度低和推理成本高而表现不佳。MultiBooth通过将多概念生成过程分为两个阶段来解决这些问题:单概念学习阶段和多概念整合阶段。在单概念学习阶段,我们采用多模态图像编码器和高效的概念编码技术,为每个概念学习简洁且具有判别性的表示。在多概念整合阶段,我们使用边界框在交叉注意力图中定义每个概念的生成区域。该方法能够在指定区域内创建独立概念,从而促进多概念图像的形成。这一策略不仅提升了概念保真度,还降低了额外推理成本。在定性和定量评估中,MultiBooth均优于多种基线方法,展现出卓越的性能和计算效率。项目页面:https://multibooth.github.io/