LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) such as ChatGPT and GPT-4. Despite LLMs' prowess in tasks like writing assistance, code generation, and machine translation, assessing their ability to reason has been challenging. Traditional evaluations often prioritize accuracy on downstream tasks over direct assessments of reasoning processes. LogicAsker addresses this gap by employing a set of atomic reasoning skills grounded in propositional and predicate logic to systematically examine and improve the reasoning prowess of LLMs. Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29\% to 90\% across different models. Moreover, we leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5\%. To our knowledge, this is the first effort to utilize test case outcomes to effectively refine LLMs' formal reasoning capabilities. We make our code, data, and results publicly available (https://github.com/yxwan123/LogicAsker) to facilitate further research and replication of our findings.

翻译：本文提出LogicAsker，一种用于评估和提升大语言模型（如ChatGPT与GPT‑4）逻辑推理能力的新方法。尽管大语言模型在写作辅助、代码生成和机器翻译等任务上表现卓越，但其推理能力的评估一直面临挑战。传统评估方法往往侧重于下游任务的准确率，而非对推理过程的直接考察。LogicAsker通过采用一套基于命题逻辑与谓词逻辑的原子推理技能，系统性地检验并提升大语言模型的推理能力，从而弥补这一空白。我们的方法揭示了大语言模型在学习逻辑规则方面存在显著缺陷，在不同模型中识别出的推理失败率介于29%至90%之间。进一步，我们利用这些发现构建了有针对性的示范样本与微调数据，显著提升了如GPT‑4o等模型的逻辑推理能力，改进幅度最高可达5%。据我们所知，这是首次利用测试用例结果有效优化大语言模型形式推理能力的研究。我们已公开代码、数据与结果（https://github.com/yxwan123/LogicAsker），以促进后续研究及成果复现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日