We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs), Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities,EasyGen leverages BiDiffuser,a bidirectional conditional diffusion model, to foster more efficient modality interactions. Easygen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM's text space with the BiDiffuser's image space, Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation. The source code is available at https://github.com/zxy556677/EasyGen.
翻译:我们提出EasyGen,一个高效模型,通过融合扩散模型和大语言模型(LLMs)的能力来增强多模态理解与生成。与现有主要依赖CLIP或ImageBind等编码器、且需要大量训练数据来桥接模态的多模态模型不同,EasyGen利用双向条件扩散模型BiDiffuser实现更高效的模态交互。EasyGen通过训练连接BiDiffuser与大语言模型的投影层实现文本生成,并通过训练适配器将大语言模型的文本空间对齐至BiDiffuser的图像空间以支持图像生成。综合定量与定性实验表明,EasyGen在数据高效训练、高质量图像生成及可扩展性方面表现优异,有效应对多模态生成中的挑战。源代码已发布于https://github.com/zxy556677/EasyGen。