Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pre-trained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It introduces vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pre-training strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.
翻译:个性化文本到图像生成旨在根据用户提供的概念,在多样化的情境中合成图像。尽管多概念个性化研究近期取得进展,但现有方法大多局限于物体概念,难以定制抽象概念(如姿态、光照)。部分方法已开始探索支持抽象概念的多概念个性化,但它们需要对每个新概念进行测试时微调,这一过程耗时且易在有限训练图像上过拟合。本研究提出一种新颖的免调优多概念个性化方法,能够在不进行测试时微调的情况下,有效定制物体与抽象概念。该方法基于预训练扩散Transformer(DiTs)模型中的调制机制,利用调制空间的局部化与语义丰富特性。具体而言,我们提出新型模块Mod-Adapter,用于预测概念相关文本词元调制过程中的概念特定调制方向。该模块引入视觉-语言交叉注意力以提取概念视觉特征,并采用专家混合(MoE)层将概念特征自适应映射至调制空间。此外,为缓解概念图像空间与调制空间巨大差异导致的训练困难,我们提出视觉语言模型引导的预训练策略,利用视觉语言模型强大的图像理解能力提供语义监督信号。为进行全面比较,我们在标准基准测试中扩展了抽象概念评估维度。通过定量、定性与人工评估验证,本方法在多概念个性化任务中达到了最先进的性能水平。