Logical reasoning, i.e., deductively inferring the truth value of a conclusion from a set of premises, is an important task for artificial intelligence with wide potential impacts on science, mathematics, and society. While many prompting-based strategies have been proposed to enable Large Language Models (LLMs) to do such reasoning more effectively, they still appear unsatisfactory, often failing in subtle and unpredictable ways. In this work, we investigate the validity of instead reformulating such tasks as modular neurosymbolic programming, which we call LINC: Logical Inference via Neurosymbolic Computation. In LINC, the LLM acts as a semantic parser, translating premises and conclusions from natural language to expressions in first-order logic. These expressions are then offloaded to an external theorem prover, which symbolically performs deductive inference. Leveraging this approach, we observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate. On ProofWriter, augmenting the comparatively small open-source StarCoder+ (15.5B parameters) with LINC even outperforms GPT-3.5 and GPT-4 with Chain-of-Thought (CoT) prompting by an absolute 38% and 10%, respectively. When used with GPT-4, LINC scores 26% higher than CoT on ProofWriter while performing comparatively on FOLIO. Further analysis reveals that although both methods on average succeed roughly equally often on this dataset, they exhibit distinct and complementary failure modes. We thus provide promising evidence for how logical reasoning over natural language can be tackled through jointly leveraging LLMs alongside symbolic provers. All corresponding code is publicly available at https://github.com/benlipkin/linc
翻译:摘要:逻辑推理,即从一组前提中演绎推断结论的真值,是人工智能的一项重要任务,对科学、数学和社会具有广泛的潜在影响。尽管已有许多基于提示的策略被提出,旨在使大语言模型更有效地进行此类推理,但这些方法仍不尽如人意,常常以微妙且不可预测的方式失败。本研究探讨了一种替代方案的有效性,即将此类任务重新表述为模块化神经符号编程,我们称之为LINC:基于神经符号计算的逻辑推理。在LINC中,语言模型充当语义解析器,将前提和结论从自然语言翻译为一阶逻辑表达式。随后,这些表达式被外包给外部定理证明器,由后者以符号方式执行演绎推理。采用这种方法,我们在FOLIO数据集和ProofWriter的平衡子集上观察到,三种不同模型在几乎所有评估的实验条件下均取得了显著的性能提升。在ProofWriter上,将相对较小的开源模型StarCoder+(155亿参数)与LINC结合,甚至比使用思维链提示的GPT-3.5和GPT-4分别高出38%和10%的绝对性能。当与GPT-4结合使用时,LINC在ProofWriter上的得分比思维链高出26%,同时在FOLIO上表现相当。进一步分析揭示,尽管两种方法在此数据集上平均成功次数大致相当,但它们的失败模式截然不同且具有互补性。因此,我们为如何通过联合利用语言模型与符号证明器来应对自然语言的逻辑推理提供了有力的证据。所有相关代码已公开于https://github.com/benlipkin/linc。