Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract reasoning ability is the goal of next-generation AI. Recent advancements in Large Language Models (LLMs), along with the emerging field of Multimodal Large Language Models (MLLMs), have demonstrated impressive capabilities across a wide range of multimodal tasks and applications. Particularly, various MLLMs, each with distinct model architectures, training data, and training stages, have been evaluated across a broad range of MLLM benchmarks. These studies have, to varying degrees, revealed different aspects of the current capabilities of MLLMs. However, the reasoning abilities of MLLMs have not been systematically investigated. In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions. We believe our survey establishes a solid base and sheds light on this important topic, multimodal reasoning.
翻译:强人工智能或具备抽象推理能力的通用人工智能是下一代人工智能的目标。大语言模型的最新进展,连同新兴的多模态大语言模型领域,已在广泛的多模态任务和应用中展现出令人瞩目的能力。特别地,具有不同模型架构、训练数据和训练阶段的各种多模态大语言模型,已在多种多模态大语言模型基准上得到评估。这些研究在不同程度上揭示了当前多模态大语言模型能力的多个方面。然而,多模态大语言模型的推理能力尚未得到系统性研究。在本综述中,我们全面回顾了现有的多模态推理评估协议,对多模态大语言模型的前沿进行了分类和阐述,介绍了多模态大语言模型在推理密集型任务中的应用新趋势,并最终讨论了当前实践与未来方向。我们相信本综述为多模态推理这一重要主题奠定了坚实基础并提供了启示。