Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models

The burgeoning interest in Multimodal Large Language Models (MLLMs), such as OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance Large Language Models (LLMs) with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, this assessment, based on a limited dataset (i.e., HellaSWAG), does not fully capture Gemini's authentic commonsense reasoning potential. To address this gap, our study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini's competitive commonsense reasoning capabilities. Additionally, we identify common challenges faced by current LLMs and MLLMs in addressing commonsense problems, underscoring the need for further advancements in enhancing the commonsense reasoning abilities of these models.

翻译：摘要：以OpenAI的GPT-4V(ision)为代表的多模态大语言模型（MLLM）近来备受关注，对学术界和工业界均产生了深远影响。这类模型通过增强大语言模型（LLM）的视觉理解能力，使其能应用于多种多模态任务。近期，谷歌推出了专为多模态集成设计的尖端MLLM——Gemini。尽管技术先进，初步基准测试显示Gemini在常识推理任务上落后于GPT系列模型。然而，这一基于有限数据集（即HellaSWAG）的评估未能充分反映Gemini的真实常识推理潜力。为弥补这一不足，本研究系统评估了Gemini在需要跨模态整合常识知识的复杂推理任务上的表现。我们对12个常识推理数据集（涵盖通用与特定领域任务）进行了全面分析，包括11个纯语言数据集和1个多模态数据集。基于4个LLM和2个MLLM的实验表明，Gemini具备具有竞争力的常识推理能力。此外，我们识别了当前LLM与MLLM在解决常识问题时的共性挑战，凸显了进一步提升模型常识推理能力的必要性。

相关内容

Gemini

关注 12

2023年12 月 6 日，谷歌 CEO 桑达尔・皮查伊官宣 Gemini 1.0 版正式上线。这次发布的 Gemini 大模型是原生多模态大模型，是谷歌大模型新时代的第一步，它包括三种量级：能力最强的 Gemini Ultra，适用于多任务的 Gemini Pro 以及适用于特定任务和端侧的 Gemini Nano。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日