Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM's inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.
翻译:将图像生成与理解整合到单一框架中已成为多模态领域的关键目标。然而,理解如何有效辅助生成尚未得到充分探索。与先前专注于利用理解模型的推理能力和世界知识的研究不同,本文引入了一个新颖视角:利用理解来增强生成图像的保真度和细节丰富度。为此,我们提出了锻造与淬火(Forge-and-Quench),一个将这一原则付诸实践的新型统一框架。在我们框架的生成过程中,一个多模态大语言模型(MLLM)首先对整个对话上下文(包括文本指令)进行推理,以生成一个增强的文本指令。随后,这一精炼后的指令通过一个新颖的桥接适配器(Bridge Adapter)映射到一个虚拟视觉表示,称为桥接特征(Bridge Feature)。该特征作为一个关键链接,将来自理解模型的洞见“锻造”出来,以“淬火”并精炼生成过程。随后,它作为视觉引导信号,与替代原始输入的增强文本指令一同注入到文本到图像(T2I)主干模型中。为了验证这一范式,我们对桥接特征和桥接适配器的设计进行了全面研究。我们的框架展示了卓越的可扩展性和灵活性,能够高效地迁移到不同的MLLM和T2I模型,显著节省训练开销,同时不损害MLLM固有的多模态理解能力。实验表明,锻造与淬火显著提升了多个模型的图像保真度和细节,同时保持了指令遵循的准确性并增强了世界知识的应用。模型和代码可在 https://github.com/YanbingZeng/Forge-and-Quench 获取。