Text-to-image diffusion models have shown remarkable success in generating a personalized subject based on a few reference images. However, current methods struggle with handling multiple subjects simultaneously, often resulting in mixed identities with combined attributes from different subjects. In this work, we present MuDI, a novel framework that enables multi-subject personalization by effectively decoupling identities from multiple subjects. Our main idea is to utilize segmented subjects generated by the Segment Anything Model for both training and inference, as a form of data augmentation for training and initialization for the generation process. Our experiments demonstrate that MuDI can produce high-quality personalized images without identity mixing, even for highly similar subjects as shown in Figure 1. In human evaluation, MuDI shows twice as many successes for personalizing multiple subjects without identity mixing over existing baselines and is preferred over 70% compared to the strongest baseline. More results are available at https://mudi-t2i.github.io/.
翻译:文本到图像扩散模型在基于少量参考图像生成个性化主体方面展现出显著成功。然而,现有方法难以同时处理多个主体,常导致混合身份问题,即生成结果融合了不同主体的属性特征。本文提出MuDI框架,通过有效解耦多主体身份实现多主体个性化。核心思想是利用"分割一切模型"(Segment Anything Model)生成的分割主体,在训练阶段作为数据增强手段,在生成阶段作为初始化条件。实验表明,MuDI能够生成高质量个性化图像且避免身份混合,即使对于高度相似的主体(如图1所示)仍表现优异。人类评估结果显示,MuDI在多主体个性化中身份混合的消除成功率是现有基线的两倍,且相较于最强基线获得超过70%的偏好度。更多结果请访问https://mudi-t2i.github.io/。