Large language models perform well on many logical reasoning benchmarks, but it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a benchmark that isolates three fundamental logical skills: (i) $\textit{formal symbolization}\unicode{x2014}{}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$showing that an argument is logically invalid by constructing a finite countermodel; and (iii) $\textit{validity assessment}\unicode{x2014}$determining whether a conclusion follows from a set of premises. Items are drawn from the two-variable fragment of first-order logic without identity and are presented in both English and a Carrollian nonce-word language. All instances are solver-verified with Z3 for correctness and non-triviality. Across conventional instruction-tuned LLMs, performance is high on $\textit{validity assessment}$ but substantially lower on $\textit{formal symbolization}$ and $\textit{countermodel construction}$, highlighting that high task-level accuracy can mask weaknesses in core logical skills. In contrast, recent reasoning-tuned models perform strongly across all three tasks, suggesting a more systematic logical skill profile.
翻译:大语言模型在许多逻辑推理基准测试中表现良好,但其真正掌握了哪些核心逻辑技能尚不明确。为此,我们提出了LogicSkills基准,该基准分离出三项基础逻辑技能:(i) $\textit{形式符号化}\unicode{x2014}{}$将前提翻译为一阶逻辑表达式;(ii) $\textit{反模型构建}\unicode{x2014}$通过构建有限反模型来论证逻辑无效性;(iii) $\textit{有效性评估}\unicode{x2014}$判断结论是否从一组前提中得出。测试项选自无等词二元一阶逻辑片段,并以英语和卡罗尔式无意义词语言两种形式呈现。所有实例均通过Z3求解器验证其正确性与非平凡性。在传统的指令微调大语言模型中,$\textit{有效性评估}$任务表现优异,但$\textit{形式符号化}$与$\textit{反模型构建}$任务表现显著较弱,这表明高任务级准确率可能掩盖核心逻辑技能的缺陷。相比之下,近期经过推理调优的模型在所有三项任务中均表现强劲,显示出更为系统化的逻辑技能图谱。