The burgeoning interest in Multimodal Large Language Models (MLLMs), such as OpenAI's GPT-4V(ision), has significantly impacted both academic and industrial realms. These models enhance Large Language Models (LLMs) with advanced visual understanding capabilities, facilitating their application in a variety of multimodal tasks. Recently, Google introduced Gemini, a cutting-edge MLLM designed specifically for multimodal integration. Despite its advancements, preliminary benchmarks indicate that Gemini lags behind GPT models in commonsense reasoning tasks. However, this assessment, based on a limited dataset (i.e., HellaSWAG), does not fully capture Gemini's authentic commonsense reasoning potential. To address this gap, our study undertakes a thorough evaluation of Gemini's performance in complex reasoning tasks that necessitate the integration of commonsense knowledge across modalities. We carry out a comprehensive analysis of 12 commonsense reasoning datasets, ranging from general to domain-specific tasks. This includes 11 datasets focused solely on language, as well as one that incorporates multimodal elements. Our experiments across four LLMs and two MLLMs demonstrate Gemini's competitive commonsense reasoning capabilities. Additionally, we identify common challenges faced by current LLMs and MLLMs in addressing commonsense problems, underscoring the need for further advancements in enhancing the commonsense reasoning abilities of these models.
翻译:摘要:以OpenAI的GPT-4V(ision)为代表的多模态大语言模型(MLLM)近来备受关注,对学术界和工业界均产生了深远影响。这类模型通过增强大语言模型(LLM)的视觉理解能力,使其能应用于多种多模态任务。近期,谷歌推出了专为多模态集成设计的尖端MLLM——Gemini。尽管技术先进,初步基准测试显示Gemini在常识推理任务上落后于GPT系列模型。然而,这一基于有限数据集(即HellaSWAG)的评估未能充分反映Gemini的真实常识推理潜力。为弥补这一不足,本研究系统评估了Gemini在需要跨模态整合常识知识的复杂推理任务上的表现。我们对12个常识推理数据集(涵盖通用与特定领域任务)进行了全面分析,包括11个纯语言数据集和1个多模态数据集。基于4个LLM和2个MLLM的实验表明,Gemini具备具有竞争力的常识推理能力。此外,我们识别了当前LLM与MLLM在解决常识问题时的共性挑战,凸显了进一步提升模型常识推理能力的必要性。