As Large Language Models (LLMs) become popular, there emerged an important trend of using multimodality to augment the LLMs' generation ability, which enables LLMs to better interact with the world. However, there lacks a unified perception of at which stage and how to incorporate different modalities. In this survey, we review methods that assist and augment generative models by retrieving multimodal knowledge, whose formats range from images, codes, tables, graphs, to audio. Such methods offer a promising solution to important concerns such as factuality, reasoning, interpretability, and robustness. By providing an in-depth review, this survey is expected to provide scholars with a deeper understanding of the methods' applications and encourage them to adapt existing techniques to the fast-growing field of LLMs.
翻译:随着大语言模型(LLMs)的普及,利用多模态增强LLMs生成能力以提升其与世界交互效果已成为重要趋势。然而,当前研究缺乏对在何种阶段以及如何整合不同模态的统一认知。本综述系统梳理了通过检索多模态知识(涵盖图像、代码、表格、图谱及音频等格式)来辅助和增强生成模型的方法。这类方法为事实性、推理能力、可解释性和鲁棒性等关键问题提供了富有前景的解决方案。通过深入评述,本综述旨在帮助学者更深刻地理解这些方法的应用场景,并鼓励其将现有技术适配至快速发展的LLMs研究领域。