Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural networks, often encompassing dozens of neural network layers and containing billions to trillions of parameters. They are typically trained on vast datasets, utilizing architectures based on transformer blocks. Present-day LLMs are multi-functional, capable of performing a range of tasks from text generation and language translation to question answering, as well as code generation and analysis. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities, including images, audio, and video. This enhancement empowers MLLMs with capabilities like video editing, image comprehension, and captioning for visual content. This survey provides a comprehensive overview of the recent advancements in LLMs. We begin by tracing the evolution of LLMs and subsequently delve into the advent and nuances of MLLMs. We analyze emerging state-of-the-art MLLMs, exploring their technical features, strengths, and limitations. Additionally, we present a comparative analysis of these models and discuss their challenges, potential limitations, and prospects for future development.
翻译:大型语言模型(LLMs)是一类擅长理解自然语言并能针对各类提示或查询生成连贯响应的深度学习模型。这些模型远超传统神经网络的复杂度,通常包含数十个神经网络层,参数量达数十亿至数万亿。它们通常在基于Transformer模块的架构上,利用海量数据集进行训练。当前的大型语言模型具备多功能性,能够执行从文本生成、语言翻译到问答,乃至代码生成与分析等一系列任务。其中被称为多模态大型语言模型(MLLMs)的先进子类,进一步扩展了LLMs的能力,使其能够处理并解析包括图像、音频和视频在内的多种数据模态。这一增强赋予MLLMs视频编辑、图像理解及视觉内容描述等能力。本综述全面概述了大型语言模型的最新进展。我们首先追溯LLMs的发展历程,随后深入探讨MLLMs的出现及其技术细节。我们分析了新兴的先进MLLMs,探讨其技术特征、优势与局限。此外,我们对这些模型进行了比较分析,并讨论了它们面临的挑战、潜在局限以及未来发展的前景。