Large language models (LLMs) have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities akin to humans. Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents ( LMAs for short). First, we introduce the essential components involved in developing LMAs and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks integrating multiple LMAs , enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, hindering effective comparison among different LMAs . Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons. Concluding our review, we highlight the extensive applications of LMAs and propose possible future research directions. Our discussion aims to provide valuable insights and guidelines for future research in this rapidly evolving field. An up-to-date resource list is available at https://github.com/jun0wanan/awesome-large-multimodal-agents.
翻译:大型语言模型在赋能基于文本的AI智能体方面已展现出卓越性能,赋予其类似人类的决策与推理能力。与此同时,一项新兴研究趋势正在兴起,即将这些由大型语言模型驱动的AI智能体扩展至多模态领域。这种扩展使AI智能体能够理解并回应多样化的多模态用户查询,从而处理更复杂、更细致的任务。本文对大型语言模型驱动的多模态智能体(简称大型多模态智能体LMA)进行了系统性综述。首先,我们介绍了开发LMA所涉及的基本组件,并将现有研究归纳为四种不同类型。随后,我们综述了整合多个LMA的协作框架,这些框架增强了集体效能。该领域的关键挑战之一是现有研究中使用的评估方法各异,阻碍了不同LMA之间的有效比较。为此,我们汇总了这些评估方法,并构建了一个全面的框架以弥合差距。该框架旨在标准化评估,促进更有意义的比较。最后,我们强调了LMA的广泛应用,并提出了未来可能的研究方向。我们的讨论旨在为这一快速发展的领域未来的研究提供有价值的见解和指导。最新的资源列表可在https://github.com/jun0wanan/awesome-large-multimodal-agents获取。