Recent developments in multimodal methodologies have marked the beginning of an exciting era for models adept at processing diverse data types, encompassing text, audio, and visual content. Models like GPT-4V, which merge computer vision with advanced language processing, exhibit extraordinary proficiency in handling intricate tasks that require a simultaneous understanding of both textual and visual information. Prior research efforts have meticulously evaluated the efficacy of these Vision Large Language Models (VLLMs) in various domains, including object detection, image captioning, and other related fields. However, existing analyses have often suffered from limitations, primarily centering on the isolated evaluation of each modality's performance while neglecting to explore their intricate cross-modal interactions. Specifically, the question of whether these models achieve the same level of accuracy when confronted with identical task instances across different modalities remains unanswered. In this study, we take the initiative to delve into the interaction and comparison among these modalities of interest by introducing a novel concept termed cross-modal consistency. Furthermore, we propose a quantitative evaluation framework founded on this concept. Our experimental findings, drawn from a curated collection of parallel vision-language datasets developed by us, unveil a pronounced inconsistency between the vision and language modalities within GPT-4V, despite its portrayal as a unified multimodal model. Our research yields insights into the appropriate utilization of such models and hints at potential avenues for enhancing their design.
翻译:近年来多模态方法的发展标志着模型处理多样化数据类型(包括文本、音频和视觉内容)的能力进入了一个令人兴奋的新时代。诸如GPT-4V这类融合计算机视觉与先进语言处理的模型,在需要同时理解文本和视觉信息的复杂任务中展现出非凡的熟练度。先前的研究工作已对这些视觉大语言模型(VLLMs)在物体检测、图像描述生成及相关领域的效能进行了细致评估。然而,现有分析往往存在局限性,主要集中于对各模态性能的孤立评估,而忽略了对它们复杂跨模态交互作用的探索。具体而言,当面对不同模态中相同任务实例时,这些模型是否能达到同等准确度的问题仍未得到解答。在本研究中,我们通过引入"跨模态一致性"这一新概念,率先深入探讨了相关模态间的交互与比较关系。基于此概念,我们进一步提出了定量评估框架。通过使用我们构建的并行视觉-语言数据集进行实验,研究结果揭示了GPT-4V在视觉与语言模态间存在显著的不一致性,尽管该模型被描述为统一的多模态系统。我们的研究为合理运用此类模型提供了见解,并为其设计改进指明了潜在路径。