MMCR：科学论文跨源推理基准测试 (MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers)

Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs' capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with only 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. Furthermore, we investigated the impact of the Chain-of-Thought (CoT) technique on cross-source reasoning and observed a detrimental effect on small models, whereas larger models demonstrated substantially enhanced performance. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.

翻译：机器对科学论文的完全理解体现了高水平的人工通用智能，这需要跨碎片化、异构信息源进行推理的能力，构成了一个复杂且具有重要实践意义的挑战。尽管视觉-语言模型（VLMs）在各类任务中取得了显著进展，尤其是在处理证据源为单张图像或单个文本页面的推理任务中，但其利用跨源信息进行推理的能力仍是一个悬而未决的问题。本研究提出了MMCR，一个旨在评估VLMs对科学论文中跨源信息进行推理能力的高难度基准。该基准包含276个高质量问题，由人工精心标注，涵盖7个学科和10种任务类型。对18个VLMs的实验表明，跨源推理对现有模型构成了重大挑战。值得注意的是，即使是表现最佳的模型GPT-4o，其总体准确率也仅为48.55%，在多表格理解任务中的准确率仅为20%；而表现第二的模型Qwen2.5-VL-72B的总体准确率为39.86%。此外，我们研究了思维链（CoT）技术对跨源推理的影响，发现其对小型模型产生了不利影响，而大型模型则表现出显著提升的性能。这些结果凸显了开发能够有效利用跨源信息进行推理的VLMs的迫切需求。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日