Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.
翻译:评估多模态大语言模型是否真正理解长篇科学论文仍然具有挑战性:仅基于答案的指标和合成的“大海捞针”测试通常只奖励答案匹配,而不要求在文档中提供因果关联、证据链接的推理轨迹。我们提出了“海洋寻鱼”范式,该范式要求模型在原生科学文档内构建显式的跨模态证据链。为实现FITO,我们构建了SIN-Data,一个保留文本与图表原生交织结构的科学交织语料库。在此基础上,我们构建了SIN-Bench,包含四个递进式任务:证据发现、假设验证、基于事实的问答以及证据锚定的摘要生成。我们进一步引入了“无证据,无分数”原则,仅对基于可验证锚点的预测进行评分,并通过匹配度、相关性和逻辑性来诊断证据质量。在八个MLLM上的实验表明,事实依据是主要瓶颈:Gemini-3-pro取得了最佳平均总分,而GPT-5在SIN-QA答案准确率上最高,但在证据对齐的综合得分上表现不佳,这暴露了答案正确性与可追溯支持之间的差距。