With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on understanding. This survey elaborates on multimodal generation across different domains, including image, video, 3D, and audio, where we highlight the notable advancements with milestone works in these fields. Specifically, we exhaustively investigate the key technical components behind methods and multimodal datasets utilized in these studies. Moreover, we dig into tool-augmented multimodal agents that can use existing generative models for human-computer interaction. Lastly, we also comprehensively discuss the advancement in AI safety and investigate emerging applications as well as future prospects. Our work provides a systematic and insightful overview of multimodal generation, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
翻译:随着大语言模型(LLMs)的最新进展,将LLMs与多模态学习相结合的研究日益受到关注。以往关于多模态大语言模型(MLLMs)的综述主要集中于理解任务。本综述详细阐述了跨不同领域——包括图像、视频、三维及音频——的多模态生成技术,重点介绍了这些领域的里程碑式工作及其显著进展。具体而言,我们详尽探讨了相关方法背后的关键技术组件以及这些研究中所使用的多模态数据集。此外,我们深入研究了能够利用现有生成模型进行人机交互的工具增强型多模态智能体。最后,我们还全面讨论了人工智能安全方面的进展,并探讨了新兴应用及未来前景。本工作为多模态生成领域提供了系统而深入的综述,有望推动生成式内容人工智能(AIGC)及世界模型的发展。所有相关论文的精选列表可在 https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation 查阅。