We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. Project page: https://snap-research.github.io/mixture-of-attention
翻译:我们提出了一种用于文本到图像扩散模型个性化定制的新架构,即混合注意力机制(MoA)。受大语言模型(LLMs)中使用的混合专家机制的启发,MoA将生成任务分配给两个注意力通路:个性化分支与非个性化先验分支。MoA通过固定先验分支中的注意力层来保留原始模型的先验知识,同时通过个性化分支最小程度地干预生成过程——该分支学习在先验分支生成的布局与上下文中嵌入主体。一种新颖的路由机制在各层的像素间分配管理,以优化个性化内容与通用内容的融合。训练完成后,MoA能够生成包含多个主体的高质量个性化图像,其组合与交互的多样性可达到与原始模型生成相当的水平。关键之处在于,MoA增强了模型既有能力与新增强的个性化干预之间的区分度,从而实现了此前无法达成的更清晰的主体-上下文解耦控制。项目页面:https://snap-research.github.io/mixture-of-attention