The exploration of multimodal language models integrates multiple data types, such as images, text, language, audio, and other heterogeneity. While the latest large language models excel in text-based tasks, they often struggle to understand and process other data types. Multimodal models address this limitation by combining various modalities, enabling a more comprehensive understanding of diverse data. This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms. Furthermore, we introduce a range of multimodal products, focusing on the efforts of major technology companies. A practical guide is provided, offering insights into the technical aspects of multimodal models. Moreover, we present a compilation of the latest algorithms and commonly used datasets, providing researchers with valuable resources for experimentation and evaluation. Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development. By addressing these aspects, this paper aims to facilitate a deeper understanding of multimodal models and their potential in various domains.
翻译:多模态语言模型的探索整合了多种数据类型,如图像、文本、语言、音频及其他异质性信息。尽管最新的大语言模型在基于文本的任务中表现出色,但它们往往难以理解和处理其他数据类型。多模态模型通过结合多种模态弥补了这一局限,从而能够更全面地理解多样化数据。本文首先定义了多模态的概念,并梳理了多模态算法的历史发展。此外,我们介绍了一系列多模态产品,重点关注主要科技公司的相关努力。本文提供了一份实用指南,深入阐释了多模态模型的技术层面。同时,我们汇编了最新的算法及常用数据集,为研究人员的实验与评估提供了宝贵资源。最后,我们探讨了多模态模型的应用及其开发过程中面临的挑战。通过上述讨论,本文旨在促进对多模态模型及其在各领域潜力的更深入理解。