Multimodal Large Language Models are primarily trained and evaluated on aligned image-text pairs, which leaves their ability to detect and resolve real-world inconsistencies largely unexplored. In open-domain applications visual and textual cues often conflict, requiring models to perform structured reasoning beyond surface-level alignment. We introduce CrossCheck-Bench, a diagnostic benchmark for evaluating contradiction detection in multimodal inputs. The benchmark adopts a hierarchical task framework covering three levels of reasoning complexity and defines seven atomic capabilities essential for resolving cross-modal inconsistencies. CrossCheck-Bench includes 15k question-answer pairs sourced from real-world artifacts with synthetically injected contradictions. The dataset is constructed through a multi-stage annotation pipeline involving more than 450 expert hours to ensure semantic validity and calibrated difficulty across perception, integration, and reasoning. We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Most models perform well on isolated entity recognition but fail when multiple clues must be synthesized for conflict reasoning. Capability-level analysis further reveals uneven skill acquisition, especially in tasks requiring multi-step inference or rule-based validation. Additional probing shows that conventional prompting strategies such as Chain-of-Thought and Set-of-Mark yield only marginal gains. By contrast, methods that interleave symbolic reasoning with grounded visual processing achieve more stable improvements. These results highlight a persistent bottleneck in multimodal reasoning and suggest new directions for building models capable of robust cross-modal verification.
翻译:多模态大语言模型主要在图文对齐的数据上进行训练和评估,这使其检测和解决现实世界中不一致性的能力在很大程度上未被充分探索。在开放域应用中,视觉和文本线索常常相互冲突,要求模型进行超越表层对齐的结构化推理。我们提出了CrossCheck-Bench,一个用于评估多模态输入中矛盾检测能力的诊断性基准。该基准采用分层任务框架,覆盖三个推理复杂度级别,并定义了解决跨模态不一致性所必需的七项原子能力。CrossCheck-Bench包含15,000个从现实世界素材中采集并人工注入矛盾的问答对。数据集通过一个多阶段标注流程构建,涉及超过450个专家工时,以确保在感知、整合和推理层面上的语义有效性和校准难度。我们评估了13个先进的视觉语言模型,并观察到随着任务从感知匹配转向逻辑矛盾检测,模型性能出现一致下降。大多数模型在孤立实体识别上表现良好,但在需要综合多条线索进行冲突推理时则表现不佳。能力层面的分析进一步揭示了技能获取的不均衡性,尤其是在需要多步推理或基于规则验证的任务中。额外的探索表明,诸如思维链和标记集等传统提示策略仅带来边际增益。相比之下,将符号推理与基于视觉的接地处理交织的方法实现了更稳定的改进。这些结果突显了多模态推理中持续存在的瓶颈,并为构建能够进行稳健跨模态验证的模型指明了新的方向。