Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation, Incremental Fine-Tuning, and Verification

Logical reasoning is a fundamental task in natural language processing that presents significant challenges to Large Language Models (LLMs). The inherent characteristics of logical reasoning makes it well-suited for symbolic representations such as first-order logic (FOL). Research in symbolic logical reasoning explored FOL generation using state-of-the-art LLMs (i.e., GPT-4) to produce FOL translations of natural language (NL) statements, but errors in translation are usually not the focus. We address this by categorizing the translation errors in FOL statements generated by LLMs. To make progress towards improving the quality of FOL translations for smaller language models such as LLaMA-2 13B and Mistral 7B, we create ProofFOL, a high-quality FOL-annotated subset of ProofWriter dataset using GPT-4o. The models fine-tuned on this silver standard data achieve a significant gain in performance when compared to larger language models such as LLaMA-2 70B. In addition to improving the model using large data, we also tackle the issue of data scarcity and introduce an incremental framework encompassing of data augmentation and verification steps. In the augmentation process, a single pair of (premises, conclusion) is split into multiple new instances based on the predicates and FOLs. This data is used for fine-tuning, and the inference on this model generates FOLs with fewer errors over the model trained on the original data. Our investigation on the translation errors leads to generation of a perturbation dataset, which is used to train a verifier that corrects potential syntactic and semantic FOL translation errors. We demonstrate an efficient method for making the most of a limited existing human-annotated dataset. Our results show state-of-the-art performance for ProofWriter and ProntoQA datasets using ProofFOL on LLaMA-2 and Mistral models.

翻译：逻辑推理是自然语言处理中的一项基础任务，对大型语言模型提出了重大挑战。逻辑推理的内在特性使其非常适合用一阶逻辑等符号化表示。符号逻辑推理领域的研究探索了使用最先进的大型语言模型生成一阶逻辑，以产生自然语言语句的一阶逻辑翻译，但翻译错误通常不是关注重点。我们通过分类大语言模型生成的一阶逻辑语句中的翻译错误来解决这一问题。为了提升如LLaMA-2 13B和Mistral 7B等较小语言模型的一阶逻辑翻译质量，我们利用GPT-4o创建了ProofFOL，这是ProofWriter数据集的一个高质量一阶逻辑标注子集。在此银标准数据上微调的模型，与LLaMA-2 70B等更大语言模型相比，性能获得了显著提升。除了利用大数据改进模型，我们还解决了数据稀缺问题，并引入了一个包含数据增强和验证步骤的增量框架。在增强过程中，基于谓词和一阶逻辑，将单个（前提，结论）对拆分为多个新实例。这些数据用于微调，在此模型上的推理生成的一阶逻辑，比在原始数据上训练的模型产生的错误更少。我们对翻译错误的调查促成了一个扰动数据集的生成，该数据集用于训练一个验证器，以纠正潜在的一阶逻辑翻译句法和语义错误。我们展示了一种有效利用有限现有人工标注数据集的方法。我们的结果表明，使用ProofFOL在LLaMA-2和Mistral模型上，在ProofWriter和ProntoQA数据集上取得了最先进的性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日