Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs

Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. To investigate this, we propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic, comprising both primitive and compositional rules across five domains. Our analysis of GPT-series models over a rule subset reveals significant gaps in LLMs' logic understanding compared to human performance, especially in compositional and structural complex rules with certain bias patterns. We further distill these rules into a smaller-scale inference engine for flexible rule generation and enhancing downstream reasoning. Through a multi-judger evaluation, our inference engine proves effective in generating accurate, complex and abstract conclusions and premises, and improve various commonsense reasoning tasks. Overall, our work sheds light on LLMs' limitations in grasping inferential rule and suggests ways to enhance their logical reasoning abilities~\footnote{Code and data are available at \url{https://github.com/SiyuanWangw/ULogic}.}.

翻译：大型语言模型（LLMs）在各种推理任务中已展现出令人印象深刻的人类水平性能。然而，其对底层推理规则的掌握仍不及人类能力。为探究此问题，我们提出一种逻辑脚手架推理规则生成框架，构建了一个包含五个领域内基本规则与组合规则的推理规则库ULogic。通过对GPT系列模型在规则子集上的分析，我们发现LLMs在逻辑理解方面与人类表现存在显著差距，尤其在具有特定偏差模式的组合规则与结构复杂规则上。我们进一步将这些规则提炼为一个小规模推理引擎，以实现灵活规则生成并增强下游推理能力。通过多评判器评估，我们的推理引擎被证明能有效生成准确、复杂且抽象的结论与前提，并提升多种常识推理任务性能。总体而言，本研究揭示了LLMs在掌握推理规则方面的局限性，并提出了增强其逻辑推理能力的可行路径~\footnote{代码与数据可在 \url{https://github.com/SiyuanWangw/ULogic} 获取。}。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日