We introduce CRYSTAL (Clear Reasoning via Yielded Steps, Traceability, and Logic), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: Match F1, which scores step-level precision and recall via semantic similarity matching, and Ordered Match F1, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline in which four independent MLLMs generate trajectories, which are then aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures that are invisible to answer accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning in which no competitive model preserves more than 60% of matched steps in the correct order. Beyond evaluation, we propose the Causal Process Reward (CPR), a multiplicative reward that couples answer correctness with step-level alignment, and CPR-Curriculum, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves a 32% improvement in Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.
翻译:我们提出了CRYSTAL(通过生成步骤、可追溯性与逻辑实现清晰推理),这是一个包含6,372个样本的诊断性基准,通过可验证的中间步骤来评估多模态推理能力。我们提出了两个互补的评估指标:Match F1(通过语义相似度匹配对步骤级精确率与召回率进行评分)以及Ordered Match F1(进一步对无序推理链施加惩罚)。参考标准的构建采用了一种受德尔菲法启发的流程:首先由四个独立的多模态大语言模型生成推理轨迹,随后通过语义聚类进行聚合,并经由人工质量关卡验证。对20个多模态大语言模型(包括基准构建过程中未使用的商业前沿系统)的评估揭示了仅凭答案准确率无法发现的系统性缺陷:普遍存在的选择性优化(精确率远高于召回率)、非单调的规模扩展权衡,以及无序推理问题——所有有竞争力的模型均无法在正确顺序下保留超过60%的匹配步骤。在评估之外,我们提出了因果过程奖励(CPR),这是一种将答案正确性与步骤对齐度耦合的乘法式奖励机制,以及CPR课程学习(CPR-Curriculum),它在训练过程中逐步提升推理难度。在GRPO框架下,当加法式奖励策略失效时,CPR课程学习实现了Match F1指标32%的提升,从而在无需人工步骤标注的情况下改进了推理能力。