Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs' capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with only 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. Furthermore, we investigated the impact of the Chain-of-Thought (CoT) technique on cross-source reasoning and observed a detrimental effect on small models, whereas larger models demonstrated substantially enhanced performance. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.
翻译:机器对科学论文的完全理解体现了高水平的人工通用智能,这需要跨碎片化、异构信息源进行推理的能力,构成了一个复杂且具有重要实践意义的挑战。尽管视觉-语言模型(VLMs)在各类任务中取得了显著进展,尤其是在处理证据源为单张图像或单个文本页面的推理任务中,但其利用跨源信息进行推理的能力仍是一个悬而未决的问题。本研究提出了MMCR,一个旨在评估VLMs对科学论文中跨源信息进行推理能力的高难度基准。该基准包含276个高质量问题,由人工精心标注,涵盖7个学科和10种任务类型。对18个VLMs的实验表明,跨源推理对现有模型构成了重大挑战。值得注意的是,即使是表现最佳的模型GPT-4o,其总体准确率也仅为48.55%,在多表格理解任务中的准确率仅为20%;而表现第二的模型Qwen2.5-VL-72B的总体准确率为39.86%。此外,我们研究了思维链(CoT)技术对跨源推理的影响,发现其对小型模型产生了不利影响,而大型模型则表现出显著提升的性能。这些结果凸显了开发能够有效利用跨源信息进行推理的VLMs的迫切需求。