Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
翻译:当前语言模型的数学推理评估主要依赖答案准确率,这可能掩盖逻辑计算中的根本性缺陷。我们提出一种诊断框架,通过四个互补维度区分真实的数学推理与浅层模式匹配:前向-后向一致性、传递性覆盖度、反事实敏感性和扰动鲁棒性。通过将该框架应用于Qwen3-0.6B模型在MenatQA数据集上的案例研究,我们揭示了表面性能与推理保真度之间的显著脱节。尽管该模型达到合理的答案准确率(70%以上),但其后向一致性较差(15%)、传递性覆盖有限(32.2%),且对扰动表现出脆弱敏感性。我们的诊断揭示了传统准确率指标无法检测的推理缺陷,表明该小型模型严重依赖模式匹配而非真正的逻辑计算。虽然实证结果基于单个6亿参数模型,但该诊断框架本身具有模型无关性和可推广性。我们公开评估方案,以帮助研究社区评估不同规模与架构模型的推理保真度,推动研究超越表面准确率,迈向可验证的数学推理。