We introduce **CRYSTAL** (*__C__lear __R__easoning via __Y__ielded __S__teps, __T__raceability and __L__ogic*), a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps. We propose two complementary metrics: *Match F1*, which scores step-level precision and recall via semantic similarity matching, and *Ordered Match F1*, which further penalizes disordered reasoning chains. References are constructed through a Delphi-inspired pipeline where four independent MLLMs generate trajectories, aggregated via semantic clustering and validated through human quality gates. Evaluation of 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy: universal cherry-picking (precision far exceeds recall), non-monotonic scaling trade-offs, and disordered reasoning where no competitive model preserves more than 60% of matched steps in correct order. Beyond evaluation, we propose the **Causal Process Reward (CPR)**, a multiplicative reward that couples answer correctness with step-level alignment, and **CPR-Curriculum**, which progressively increases reasoning difficulty during training. CPR-Curriculum achieves +32% Match F1 via GRPO where additive reward strategies fail, improving reasoning without manual step annotation.
翻译:我们提出**CRYSTAL**(*基于产出步骤、可追溯性与逻辑的清晰推理*),这是一个包含6,372个样本的诊断性基准,通过可验证的中间步骤评估多模态推理能力。我们提出了两个互补的评估指标:*匹配F1值*——通过语义相似度匹配对步骤级精确率与召回率进行评分;*有序匹配F1值*——进一步对无序推理链施加惩罚。参考标准的构建采用德尔菲式流程:四个独立的多模态大语言模型生成推理轨迹,通过语义聚类进行聚合,并经由人工质量关卡验证。对20个多模态大语言模型(包括基准构建阶段未使用的商业前沿系统)的评估揭示了准确率指标无法发现的系统性缺陷:普遍存在的选择性优化(精确率远高于召回率)、非单调的规模扩展权衡,以及无序推理问题——所有竞争模型中保持正确顺序的匹配步骤均未超过60%。除评估外,我们提出**因果过程奖励**——一种将答案正确性与步骤对齐度耦合的乘积式奖励机制,以及**CPR课程学习**——在训练过程中逐步提升推理难度。通过GRPO框架,CPR课程学习实现了匹配F1值+32%的提升(而加性奖励策略在此失效),在无需人工步骤标注的情况下显著改善了推理能力。