Multimodal large language models (MLLMs) have demonstrated promising results in a variety of tasks that combine vision and language. As these models become more integral to research and applications, conducting comprehensive evaluations of their capabilities has grown increasingly important. However, most existing benchmarks fail to consider that, in certain situations, images need to be interpreted within a broader context. In this work, we introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. Further analysis confirms that these models struggle to effectively extract and utilize contextual information to improve their understanding of images. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner. View our project website at https://thunlp-mt.github.io/CODIS.
翻译:多模态大语言模型(MLLMs)在融合视觉与语言的多类任务中展现了显著成效。随着这些模型在研究与应用中日益关键,对其能力展开全面评估变得愈发重要。然而,现有大多数基准未能考虑在某些情境下,图像需置于更广泛语境中进行解读。本研究提出新基准CODIS,旨在评估模型利用自由文本提供的上下文增强视觉理解的能力。研究结果表明,多模态大语言模型在该基准上的表现始终逊于人类。进一步分析证实,这些模型难以有效提取并利用上下文信息以提升对图像的理解。这凸显了增强多模态大语言模型以上下文依赖方式理解视觉内容的迫切需求。项目官网:https://thunlp-mt.github.io/CODIS。