Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.
翻译:大语言模型(LLMs)在逻辑推理基准测试中表现强劲,但其可靠性仍不确定。现有评估依赖静态基准,无法检验在逻辑等价变换下的鲁棒性,且常高估推理能力。我们提出LGMT(Logic-Grounded Metamorphic Testing),一种无需参考的框架,利用一阶逻辑(FOL)评估LLM推理。通过从形式逻辑等价性中推导蜕变关系,LGMT构建语义不变的测试用例,并通过跨样例一致性检查检测推理缺陷。对六种最先进LLM的实验表明,LGMT揭示了传统基于参考的评估所遗漏的大量隐藏缺陷。进一步发现,模型对符号级和结论级变化尤为敏感,且高级提示方法如Few-shot CoT仅能部分缓解这些问题。这些结果表明,LLM评估应超越孤立正确性,迈向逻辑不变性下的鲁棒性。LGMT为诊断推理失败提供了原则性和可扩展的方法。