We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) such as ChatGPT and GPT-4. Despite LLMs' prowess in tasks like writing assistance, code generation, and machine translation, assessing their ability to reason has been challenging. Traditional evaluations often prioritize accuracy on downstream tasks over direct assessments of reasoning processes. LogicAsker addresses this gap by employing a set of atomic reasoning skills grounded in propositional and predicate logic to systematically examine and improve the reasoning prowess of LLMs. Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29\% to 90\% across different models. Moreover, we leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5\%. To our knowledge, this is the first effort to utilize test case outcomes to effectively refine LLMs' formal reasoning capabilities. We make our code, data, and results publicly available (https://github.com/yxwan123/LogicAsker) to facilitate further research and replication of our findings.
翻译:本文提出LogicAsker,一种用于评估和提升大语言模型(如ChatGPT与GPT‑4)逻辑推理能力的新方法。尽管大语言模型在写作辅助、代码生成和机器翻译等任务上表现卓越,但其推理能力的评估一直面临挑战。传统评估方法往往侧重于下游任务的准确率,而非对推理过程的直接考察。LogicAsker通过采用一套基于命题逻辑与谓词逻辑的原子推理技能,系统性地检验并提升大语言模型的推理能力,从而弥补这一空白。我们的方法揭示了大语言模型在学习逻辑规则方面存在显著缺陷,在不同模型中识别出的推理失败率介于29%至90%之间。进一步,我们利用这些发现构建了有针对性的示范样本与微调数据,显著提升了如GPT‑4o等模型的逻辑推理能力,改进幅度最高可达5%。据我们所知,这是首次利用测试用例结果有效优化大语言模型形式推理能力的研究。我们已公开代码、数据与结果(https://github.com/yxwan123/LogicAsker),以促进后续研究及成果复现。