MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.

翻译：多模态大语言模型（MLLMs）越来越多地用于执行视觉工作流，例如导航图形用户界面（GUI），其中下一步操作取决于已验证的视觉组合条件（例如，“如果出现权限对话框且界面颜色为绿色，则点击‘允许’”），且该过程可能提前分支或终止。然而，这种能力仍未得到充分评估：现有基准主要关注浅层组合或独立约束，而非深度链式组合条件。本文中，我们提出了MM-CondChain，一个用于视觉基础深度组合推理的基准。每个基准实例被组织为一个多层推理链，其中每一层都包含一个基于视觉证据构建的非平凡组合条件，该条件由多个对象、属性或关系构成。要正确回答问题，MLLM必须详细感知图像，在每一步对多个视觉元素进行推理，并沿着由此产生的执行路径推导出最终结果。为了可扩展地构建此类工作流风格的数据，我们提出了一种智能体合成流水线：一个规划器（Planner）协调逐层生成组合条件，而一个可验证的程序化中间表示（VPIR）确保每一层的条件在机制上是可验证的。随后，一个组合器（Composer）将这些已验证的层组装成完整的指令。利用该流水线，我们在三个视觉领域构建了基准：自然图像、数据图表和GUI轨迹。在一系列MLLMs上的实验表明，即使是最强的模型也仅达到53.33%的路径F1分数，且在困难负例以及随着深度或谓词复杂性增加时性能急剧下降，这证实了深度组合推理仍然是一个根本性的挑战。