Large language models have demonstrated notable performance across various logical reasoning benchmarks. However, it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a unified benchmark designed to isolate three fundamental skills in formal reasoning: (i) $\textit{formal symbolization}\unicode{x2014}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$formulating a finite structure in which all premises are true while the conclusion is false; and (iii) $\textit{validity assessment}\unicode{x2014}$deciding whether a conclusion follows from a given set of premises. Items are drawn from the two-variable fragment of first-order logic (without identity) and are presented in both natural English and a Carroll-style language with nonce words. All examples are verified for correctness and non-triviality using the SMT solver Z3. Across leading models, performance is high on validity but substantially lower on symbolization and countermodel construction, suggesting reliance on surface-level patterns rather than genuine symbolic or rule-based reasoning.
翻译:大型语言模型已在多种逻辑推理基准测试中展现出显著性能。然而,其真正掌握的核心逻辑技能尚不明确。为此,我们提出LogicSkills——一个旨在分离形式化推理中三项基本技能的统一基准:(i) $\textit{形式符号化}\unicode{x2014}$将前提转化为一阶逻辑表达式;(ii) $\textit{反模型构建}\unicode{x2014}$构建一个有限结构使得所有前提为真而结论为假;(iii) $\textit{有效性判定}\unicode{x2014}$判断结论是否从给定前提集中必然得出。测试项选自无等词的一阶逻辑二变元片段,并以自然英语和含虚构词汇的Carroll式语言两种形式呈现。所有示例均通过SMT求解器Z3验证了正确性与非平凡性。在主流模型中,有效性判定任务表现优异,但符号化与反模型构建任务表现显著偏低,这表明模型依赖表层模式而非真正的符号化或基于规则的推理。