Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from benchmarks and human studies across modalities. We further analyse trustworthiness, safety, and ethical risks, including multimodal bias, privacy leakage, and the misuse of high-fidelity media generation for deepfakes, disinformation, and copyright infringement in music and 3D assets, together with emerging mitigation strategies. Finally, we discuss how architectural trends, evaluation protocols, and governance mechanisms can be co-designed to close current capability and safety gaps, outlining critical paths toward more general-purpose, controllable, and accountable multimodal generative systems.
翻译:多模态生成模型(MGMs)已迅速超越文本生成范畴,现通过将语言与其他感知模态在统一架构下整合,覆盖了包括图像、音乐、视频、人体动作和3D对象在内的多样化输出模态。本综述对六种主要生成模态进行了分类,并探讨了基础技术——即自监督学习(SSL)、专家混合(MoE)、基于人类反馈的强化学习(RLHF)和思维链(CoT)提示——如何实现跨模态能力。我们分析了关键模型、架构趋势和新兴的跨模态协同效应,同时强调了可迁移的技术和尚未解决的挑战。基于模型和训练方法的通用分类体系,我们提出了一个以忠实性、组合性和鲁棒性为核心的统一评估框架,并综合了跨模态基准测试和人类研究中的证据。我们进一步分析了可信度、安全性和伦理风险,包括多模态偏见、隐私泄露,以及高保真媒体生成在音乐和3D资产中用于深度伪造、虚假信息和版权侵权的滥用问题,同时探讨了新兴的缓解策略。最后,我们讨论了如何协同设计架构趋势、评估协议和治理机制,以弥合当前的能力与安全差距,并概述了通往更通用、可控和负责任的多模态生成系统的关键路径。