Existing multimodal conditional image synthesis (MCIS) methods generate images conditioned on any combinations of various modalities that require all of them must be exactly conformed, hindering the synthesis controllability and leaving the potential of cross-modality under-exploited. To this end, we propose to generate images conditioned on the compositions of multimodal control signals, where modalities are imperfectly complementary, i.e., composed multimodal conditional image synthesis (CMCIS). Specifically, we observe two challenging issues of the proposed CMCIS task, i.e., the modality coordination problem and the modality imbalance problem. To tackle these issues, we introduce a Mixture-of-Modality-Tokens Transformer (MMoT) that adaptively fuses fine-grained multimodal control signals, a multimodal balanced training loss to stabilize the optimization of each modality, and a multimodal sampling guidance to balance the strength of each modality control signal. Comprehensive experimental results demonstrate that MMoT achieves superior performance on both unimodal conditional image synthesis (UCIS) and MCIS tasks with high-quality and faithful image synthesis on complex multimodal conditions. The project website is available at https://jabir-zheng.github.io/MMoT.
翻译:现有面向多模态条件图像合成(MCIS)的方法仅能基于多种模态的任意组合生成图像,且要求所有模态必须精确匹配,这限制了合成控制能力,并导致跨模态潜力未被充分利用。为此,我们提出在模态非完全互补的多模态控制信号组合条件下生成图像,即组合多模态条件图像合成(CMCIS)。具体而言,我们观察到所提出的CMCIS任务存在两大挑战性问题:模态协调问题与模态不平衡问题。为解决这些问题,我们引入了一种自适应融合细粒度多模态控制信号的模态标记混合变换器(MMoT)、一种用于稳定各模态优化过程的多模态均衡训练损失函数,以及一种用于平衡各模态控制信号强度的多模态采样引导机制。综合实验结果表明,MMoT在单模态条件图像合成(UCIS)与MCIS任务上均展现出卓越性能,能在复杂多模态条件下生成高质量且忠实于条件的图像。项目网站详见https://jabir-zheng.github.io/MMoT。