A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

Chaoyou Fu,Renrui Zhang,Zihan Wang,Yubo Huang,Zhengye Zhang,Longtian Qiu,Gaoxiang Ye,Yunhang Shen,Mengdan Zhang,Peixian Chen,Sirui Zhao,Shaohui Lin,Deqiang Jiang,Di Yin,Peng Gao,Ke Li,Hongsheng Li,Xing Sun

from arxiv, Total 120 pages. See our project at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models

The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

翻译：多模态大语言模型（MLLMs）的兴起，例如OpenAI的GPT-4V（视觉版），已成为学术界与工业界的显著趋势。此类模型赋予大语言模型（LLMs）强大的视觉理解能力，使其能够处理多样化的多模态任务。近期，谷歌发布了Gemini——这是其最新且能力最强的多模态大语言模型，从底层架构便专为多模态设计。基于其卓越的推理能力，Gemini能否挑战GPT-4V在多模态学习领域的领先地位？本文对Gemini Pro的视觉理解能力进行了初步探索，全面涵盖四大领域：基础感知、高级认知、高难度视觉任务以及各类专家能力。我们将其与当前最优的GPT-4V进行对比以评估其上限，同时与最新开源的多模态大语言模型Sphinx进行比较，揭示人工系统与黑箱系统之间的差距。定性样本显示，尽管GPT-4V与Gemini呈现出不同的回答风格与偏好，但二者在视觉推理能力上表现相当，而Sphinx在领域泛化性方面仍落后于它们。具体而言，GPT-4V倾向于生成详细的解释与中间步骤，而Gemini则偏好直接简洁的答案。基于主流MME基准的定量评估同样表明，Gemini具备成为GPT-4V强力挑战者的潜力。我们对Gemini的初步探究也观察到多模态大语言模型存在的共性问题，表明距离通用人工智能仍有相当差距。我们用于追踪多模态大语言模型进展的项目已发布于https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日