Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs' ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/
翻译:近期人工智能的进展推动了大型多模态模型(LMM)的发展,使其能够处理涉及图像中文本与视觉内容联合推理的复杂任务(例如公共场所地图导航)。本文提出ConTextual——一个专门用于评估LMM执行上下文敏感型图文丰富视觉推理能力的新颖基准,包含显式设计的指令集。ConTextual涵盖时间读取、导航、购物等多类真实场景,要求深度理解文本与视觉元素间的交互关系。通过人工评估发现,性能最优的LMM模型GPT-4V(ision)与人类能力之间存在30.8%的显著差距,表明上下文敏感型图文丰富视觉推理仍有较大改进空间。值得注意的是,尽管GPT-4V在模因和语录解读等抽象类别中表现优异,但其整体性能仍落后于人类。除人工评估外,我们采用GPT-4自动评估指标揭示了相似的性能差异趋势。此外,通过跨视觉上下文的细粒度评估与定性分析,为未来LMM设计发展提供了稳健框架。https://con-textual.github.io/