We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities, EasyGen leverages BiDiffuser, a bidirectional conditional diffusion model, to foster more efficient modality interactions. EasyGen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space. Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.
翻译:我们提出的EasyGen是一种高效模型,旨在通过整合扩散模型与大型语言模型的能力,增强多模态理解与生成能力。不同于现有主要依赖CLIP或ImageBind等编码器、且需大量训练数据来桥接模态的多模态模型,EasyGen采用双向条件扩散模型BiDiffuser,以实现更高效的模态交互。EasyGen通过训练连接BiDiffuser与LLM的投影层实现文本生成,并通过训练适配器对齐LLM的文本空间与BiDiffuser的图像空间,从而支持图像生成。全面的定量与定性实验表明,EasyGen在数据高效训练、高质量图像生成及可扩展性方面表现优异,有效解决了多模态生成中的挑战。源代码见https://github.com/zxy556677/EasyGen。