We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge the gap between modalities, EasyGen is built upon a bidirectional conditional diffusion model named BiDiffuser, which promotes more efficient interactions between modalities. EasyGen handles image-to-text generation by integrating BiDiffuser and an LLM via a simple projection layer. Unlike most existing multimodal models that are limited to generating text responses, EasyGen can also facilitate text-to-image generation by leveraging the LLM to create textual descriptions, which can be interpreted by BiDiffuser to generate appropriate visual responses. Extensive quantitative and qualitative experiments demonstrate the effectiveness of EasyGen, whose training can be easily achieved in a lab setting. The source code is available at https://github.com/zxy556677/EasyGen.
翻译:本文提出EasyGen,一种高效模型,旨在通过利用扩散模型和大语言模型(LLMs)的能力来增强多模态理解与生成。与现有主要依赖CLIP或ImageBind等编码器、并需要大量训练数据来弥合模态间差距的多模态模型不同,EasyGen基于名为BiDiffuser的双向条件扩散模型构建,该模型能促进模态间更高效的交互。EasyGen通过一个简单的投影层整合BiDiffuser和大语言模型,实现图像到文本的生成。与多数仅能生成文本响应的现有多模态模型不同,EasyGen还能利用大语言模型生成文本描述,进而由BiDiffuser解释以生成适当的视觉响应,从而支持文本到图像的生成。广泛的定量与定性实验证明了EasyGen的有效性,其训练可在实验室环境下轻松实现。源代码见https://github.com/zxy556677/EasyGen。