Multimodal large language models (MLLMs) have demonstrated promising results in a variety of tasks that combine vision and language. As these models become more integral to research and applications, conducting comprehensive evaluations of their capabilities has grown increasingly important. However, most existing benchmarks fail to consider that, in certain situations, images need to be interpreted within a broader context. In this work, we introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. Further analysis confirms that these models struggle to effectively extract and utilize contextual information to improve their understanding of images. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner. View our project website at https://thunlp-mt.github.io/CODIS.
翻译:多模态大语言模型在融合视觉与语言的各类任务中已展现出令人瞩目的成果。随着这些模型在研究和应用中日益重要,对其能力进行综合评估变得愈发关键。然而,现有大多数基准测试未能考虑到在特定情境中,图像需在更广泛语境下加以解读。本研究提出名为CODIS的新型基准,旨在评估模型利用自由文本提供的上下文信息增强视觉理解的能力。研究结果表明,多模态大语言模型在该基准上的表现始终落后于人类水平。进一步分析证实,这些模型难以有效提取并利用上下文信息以提升图像理解能力。这凸显了亟需增强多模态大语言模型在上下文依赖方式下理解视觉内容的能力。项目网站详见https://thunlp-mt.github.io/CODIS。