With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
翻译:随着大语言模型(LLMs)的最新进展,将LLMs与多模态学习相结合的研究日益受到关注。以往关于多模态大语言模型(MLLMs)的综述主要集中于多模态理解。本综述系统阐述了跨图像、视频、三维及音频等多领域的多模态生成与编辑技术。具体而言,我们总结了这些领域具有里程碑意义的重要进展,并将相关研究归类为基于LLM的方法与基于CLIP/T5的方法。随后,我们归纳了LLMs在多模态生成中的不同作用,深入剖析了这些方法背后的关键技术组件以及研究中采用的多模态数据集。此外,我们深入探讨了能够利用现有生成模型进行人机交互的工具增强型多模态智能体。最后,我们讨论了生成式人工智能安全领域的发展,研究了新兴应用场景,并展望了未来前景。本工作为多模态生成与处理提供了系统而深入的概述,有望推动生成式内容人工智能(AIGC)与世界模型的发展。相关论文的精选列表可在 https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation 查阅。