Logical reasoning is a fundamental task in natural language processing that presents significant challenges to Large Language Models (LLMs). The inherent characteristics of logical reasoning makes it well-suited for symbolic representations such as first-order logic (FOL). Research in symbolic logical reasoning explored FOL generation using state-of-the-art LLMs (i.e., GPT-4) to produce FOL translations of natural language (NL) statements, but errors in translation are usually not the focus. We address this by categorizing the translation errors in FOL statements generated by LLMs. To make progress towards improving the quality of FOL translations for smaller language models such as LLaMA-2 13B and Mistral 7B, we create ProofFOL, a high-quality FOL-annotated subset of ProofWriter dataset using GPT-4o. The models fine-tuned on this silver standard data achieve a significant gain in performance when compared to larger language models such as LLaMA-2 70B. In addition to improving the model using large data, we also tackle the issue of data scarcity and introduce an incremental framework encompassing of data augmentation and verification steps. In the augmentation process, a single pair of (premises, conclusion) is split into multiple new instances based on the predicates and FOLs. This data is used for fine-tuning, and the inference on this model generates FOLs with fewer errors over the model trained on the original data. Our investigation on the translation errors leads to generation of a perturbation dataset, which is used to train a verifier that corrects potential syntactic and semantic FOL translation errors. We demonstrate an efficient method for making the most of a limited existing human-annotated dataset. Our results show state-of-the-art performance for ProofWriter and ProntoQA datasets using ProofFOL on LLaMA-2 and Mistral models.
翻译:逻辑推理是自然语言处理中的一项基础任务,对大型语言模型提出了重大挑战。逻辑推理的内在特性使其非常适合用一阶逻辑等符号化表示。符号逻辑推理领域的研究探索了使用最先进的大型语言模型生成一阶逻辑,以产生自然语言语句的一阶逻辑翻译,但翻译错误通常不是关注重点。我们通过分类大语言模型生成的一阶逻辑语句中的翻译错误来解决这一问题。为了提升如LLaMA-2 13B和Mistral 7B等较小语言模型的一阶逻辑翻译质量,我们利用GPT-4o创建了ProofFOL,这是ProofWriter数据集的一个高质量一阶逻辑标注子集。在此银标准数据上微调的模型,与LLaMA-2 70B等更大语言模型相比,性能获得了显著提升。除了利用大数据改进模型,我们还解决了数据稀缺问题,并引入了一个包含数据增强和验证步骤的增量框架。在增强过程中,基于谓词和一阶逻辑,将单个(前提,结论)对拆分为多个新实例。这些数据用于微调,在此模型上的推理生成的一阶逻辑,比在原始数据上训练的模型产生的错误更少。我们对翻译错误的调查促成了一个扰动数据集的生成,该数据集用于训练一个验证器,以纠正潜在的一阶逻辑翻译句法和语义错误。我们展示了一种有效利用有限现有人工标注数据集的方法。我们的结果表明,使用ProofFOL在LLaMA-2和Mistral模型上,在ProofWriter和ProntoQA数据集上取得了最先进的性能。