RealCQA-V2: A Diagnostic Benchmark for Structured Visual Entailment over Scientific Charts

Multimodal reasoning models often produce fluent answers supported by seemingly coherent rationales. Existing benchmarks evaluate only final-answer correctness. They do not support atomic visual entailment verification of intermediate steps, especially visual compositional logic. This limitation is especially acute in scientific chart understanding, where answers depend on deterministically grounded visual semantics such as axes, legends, and quantitative relations. We introduce RealCQA-V2, a large-scale benchmark that reformulates chart question answering as Visual Premise Proving (VPP): a structured logical entailment task over chart-grounded visual predicates. Each question is deconstructed into manually curated, atomic premises grounded in chart elements (axes, legends, marks, and quantitative relations), yielding executable reasoning chains rather than free-form textual rationales. These premises form compositional reasoning chains, enabling verification at the level of individual visual statements and complete reasoning sequences. We introduce chain-level metrics that measure both full logical validity (AccVPP) and partial reasoning progress within failed chains (DCP), extending beyond traditional VQA accuracy. Baseline evaluations across representative LVLMs reveal a consistent local-global reasoning gap: models often verify many individual premises correctly while failing to preserve coherence across the full chain. RealCQA-V2 establishes a reproducible benchmark for structured visual entailment over real scientific charts and enables rigorous diagnosis of multimodal reasoning beyond answer-only evaluation.

翻译：多模态推理模型常能生成流畅答案并附以看似连贯的推理依据。现有基准仅评估最终答案的正确性，不支持对中间步骤（尤其是视觉组合逻辑）进行原子级视觉蕴涵验证。这一局限在科学图表理解中尤为突出——此类任务的答案依赖于确定性视觉语义（如坐标轴、图例及定量关系）。我们提出RealCQA-V2，一个大规模基准将图表问答重构为视觉前提证明（VPP）：一种基于图表视觉谓词的结构化逻辑蕴涵任务。每个问题被分解为手工标注的、锚定于图表元素（坐标轴、图例、标记及定量关系）的原子前提，从而生成可执行的推理链而非自由形式文本推理。这些前提构成组合推理链，支持在单个视觉陈述及完整推理序列层面进行验证。我们提出链级度量：既衡量完整逻辑有效性（AccVPP），又衡量失败链内的部分推理进展（DCP），拓展了传统VQA准确率。对代表性LVLM的基线评估揭示了一致性的局部-全局推理鸿沟：模型常能正确验证许多个体前提，却难以保持跨完整链的连贯性。RealCQA-V2为真实科学图表上的结构化视觉蕴涵建立了可复现基准，并实现了超越仅答案评估的多模态推理严格诊断。