The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.
翻译:多模态大语言模型(MLLMs)的兴起,例如OpenAI的GPT-4V(视觉版),已成为学术界与工业界的显著趋势。此类模型赋予大语言模型(LLMs)强大的视觉理解能力,使其能够处理多样化的多模态任务。近期,谷歌发布了Gemini——这是其最新且能力最强的多模态大语言模型,从底层架构便专为多模态设计。基于其卓越的推理能力,Gemini能否挑战GPT-4V在多模态学习领域的领先地位?本文对Gemini Pro的视觉理解能力进行了初步探索,全面涵盖四大领域:基础感知、高级认知、高难度视觉任务以及各类专家能力。我们将其与当前最优的GPT-4V进行对比以评估其上限,同时与最新开源的多模态大语言模型Sphinx进行比较,揭示人工系统与黑箱系统之间的差距。定性样本显示,尽管GPT-4V与Gemini呈现出不同的回答风格与偏好,但二者在视觉推理能力上表现相当,而Sphinx在领域泛化性方面仍落后于它们。具体而言,GPT-4V倾向于生成详细的解释与中间步骤,而Gemini则偏好直接简洁的答案。基于主流MME基准的定量评估同样表明,Gemini具备成为GPT-4V强力挑战者的潜力。我们对Gemini的初步探究也观察到多模态大语言模型存在的共性问题,表明距离通用人工智能仍有相当差距。我们用于追踪多模态大语言模型进展的项目已发布于https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。