Recently, there has been a surge in interest in NLP driven by ChatGPT. ChatGPT, a transformer-based generative language model of substantial scale, exhibits versatility in performing various tasks based on natural language. Nevertheless, large language models often exhibit poor performance in solving mathematics questions that require reasoning. Prior research has demonstrated the effectiveness of chain-of-thought prompting in enhancing reasoning capabilities. Now, we aim to investigate whether fine-tuning a model for the generation of Prolog codes, a logic language, and subsequently passing these codes to a compiler can further improve accuracy. Consequently, we employ chain-of-thought to fine-tune LLaMA7B as a baseline model and develop other fine-tuned LLaMA7B models for the generation of Prolog code, Prolog code + chain-of-thought, and chain-of-thought + Prolog code, respectively. The results reveal that the Prolog generation model surpasses the baseline in performance, while the combination generation models do not yield significant improvements. The Prolog corpus based on GSM8K and the correspondingly finetuned Prolog generation model based on LLaMA7B are released to the research community.
翻译:近期,ChatGPT引发了自然语言处理领域的研究热潮。作为基于Transformer架构的大规模生成式语言模型,ChatGPT展现出基于自然语言执行多元任务的通用能力。然而,大型语言模型在需要推理能力的数学问题求解中往往表现欠佳。已有研究表明,思维链提示能有效增强推理能力。本研究旨在探究:针对逻辑语言Prolog的代码生成对模型进行微调,再将这些代码输入编译器,是否能进一步提升准确率。我们采用思维链方法对LLaMA7B进行微调作为基线模型,并开发其他三种微调变体:Prolog代码生成模型、Prolog代码+思维链生成模型、思维链+Prolog代码生成模型。实验结果表明,Prolog代码生成模型性能优于基线模型,而组合生成模型未产生显著提升。本研究基于GSM8K构建的Prolog语料库及对应微调的LLaMA7B-Prolog代码生成模型已面向研究社区开放。