Large language models perform well on many logical reasoning benchmarks, but it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a benchmark that isolates three fundamental logical skills: (i) $\textit{formal symbolization}\unicode{x2014}{}$translating premises into first-order logic; (ii) $\textit{countermodel construction}\unicode{x2014}$showing that an argument is logically invalid by constructing a finite countermodel; and (iii) $\textit{validity assessment}\unicode{x2014}$determining whether a conclusion follows from a set of premises. Items are drawn from the two-variable fragment of first-order logic without identity and are presented in both English and a Carrollian nonce-word language. All instances are solver-verified with Z3 for correctness and non-triviality. Across conventional instruction-tuned LLMs, performance is high on $\textit{validity assessment}$ but substantially lower on $\textit{formal symbolization}$ and $\textit{countermodel construction}$, highlighting that high task-level accuracy can mask weaknesses in core logical skills. In contrast, recent reasoning-tuned models perform strongly across all three tasks, suggesting a more systematic logical skill profile.
翻译:暂无翻译