Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models' ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM's reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.
翻译:尽管近期视觉生成领域取得了显著进展,但大多数现有架构仍然依赖于独立的图像和文本编码器。这种分离限制了扩散模型执行跨模态推理和知识迁移的能力。先前弥合这一差距的尝试通常仅利用视觉-语言模型(VLM)最后一层的信息、采用多个视觉编码器,或者联合训练用于文本和图像生成的大型统一模型,这些方法需要大量计算资源和大规模数据,限制了其可及性。我们提出了UniFusion,这是一种基于扩散的生成模型,其条件输入是一个冻结的大型视觉-语言模型(VLM),该模型充当统一的多模态编码器。UniFusion的核心是层级注意力池化(LAP)机制,该机制从冻结VLM的文本和视觉标记中提取高层语义和低层细节,以作为扩散生成模型的条件输入。我们证明,在生成任务的文本-图像对齐以及将视觉信息从VLM忠实迁移到扩散模型(这对于编辑任务至关重要)方面,LAP优于其他浅层融合架构。我们提出了VLM启用的灵活推理重写注入(VERIFI),该方法仅以VLM在模型内提示重写过程中生成的文本标记作为扩散Transformer(DiT)的条件输入。VERIFI将条件分布的对齐与VLM的推理能力相结合,从而在推理时增强了模型的能力和灵活性。此外,在编辑任务上进行微调不仅改善了生成的文本-图像对齐(这指示了跨模态知识迁移),还展现出强大的泛化能力。我们的模型在单图像编辑任务上训练后,能够零样本泛化到多图像参考,这进一步印证了UniFusion统一编码器设计的优越性。