Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module within the Diffusion Transformer (DiT). This module restricts each text token to attend only to its matched reference regions, ensuring precise attribute binding and reliable multi-subject composition. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the multi-subject image generation, substantially improving prompt following and subject consistency.
翻译:多主体图像生成旨在合成既忠实保留多个参考主体身份,又能遵循文本指令的图像。然而,现有方法通常存在身份不一致和组合控制有限的问题,因为它们依赖扩散模型隐式地将文本提示与参考图像关联。在本工作中,我们提出了层次化概念到外观引导(CAG)框架,该框架提供了从高层概念到细粒度外观的显式、结构化监督。在概念层面,我们引入了一种VAE丢弃训练策略,该策略随机省略参考VAE特征,鼓励模型更多地依赖来自视觉语言模型(VLM)的鲁棒语义信号,从而在缺乏完整外观线索的情况下促进一致的概念层面生成。在外观层面,我们将VLM推导出的对应关系集成到扩散Transformer(DiT)中的一个对应感知掩码注意力模块中。该模块限制每个文本标记仅关注其匹配的参考区域,从而确保精确的属性绑定和可靠的多主体组合。大量实验表明,我们的方法在多主体图像生成任务上取得了最先进的性能,显著提升了提示遵循能力和主体一致性。