Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. To investigate this, we propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic, comprising both primitive and compositional rules across five domains. Our analysis of GPT-series models over a rule subset reveals significant gaps in LLMs' logic understanding compared to human performance, especially in compositional and structural complex rules with certain bias patterns. We further distill these rules into a smaller-scale inference engine for flexible rule generation and enhancing downstream reasoning. Through a multi-judger evaluation, our inference engine proves effective in generating accurate, complex and abstract conclusions and premises, and improve various commonsense reasoning tasks. Overall, our work sheds light on LLMs' limitations in grasping inferential rule and suggests ways to enhance their logical reasoning abilities~\footnote{Code and data are available at \url{https://github.com/SiyuanWangw/ULogic}.}.
翻译:大型语言模型(LLMs)在各种推理任务中已展现出令人印象深刻的人类水平性能。然而,其对底层推理规则的掌握仍不及人类能力。为探究此问题,我们提出一种逻辑脚手架推理规则生成框架,构建了一个包含五个领域内基本规则与组合规则的推理规则库ULogic。通过对GPT系列模型在规则子集上的分析,我们发现LLMs在逻辑理解方面与人类表现存在显著差距,尤其在具有特定偏差模式的组合规则与结构复杂规则上。我们进一步将这些规则提炼为一个小规模推理引擎,以实现灵活规则生成并增强下游推理能力。通过多评判器评估,我们的推理引擎被证明能有效生成准确、复杂且抽象的结论与前提,并提升多种常识推理任务性能。总体而言,本研究揭示了LLMs在掌握推理规则方面的局限性,并提出了增强其逻辑推理能力的可行路径~\footnote{代码与数据可在 \url{https://github.com/SiyuanWangw/ULogic} 获取。}。