In this paper we examine the limitations of Large Language Models (LLMs) for complex reasoning tasks. Although recent works have started to employ formal languages as an intermediate representation for reasoning tasks, they often face challenges in accurately generating and refining these formal specifications to ensure correctness. To address these issues, this paper proposes Logic-LM++, an improvement on Logic-LM . It uses the ability of LLMs to do pairwise comparisons, allowing the evaluation of the refinements suggested by the LLM. The paper demonstrates that Logic-LM++ outperforms Logic-LM and other contemporary techniques across natural language reasoning tasks on three datasets, FOLIO, ProofWriter and AR-LSAT, with an average improvement of 18.5% on standard prompting, 12.3% on chain of thought prompting and 5% on Logic-LM.
翻译:本文探讨了大型语言模型(LLM)在复杂推理任务中的局限性。尽管近期研究已开始采用形式化语言作为推理任务的中间表示,但这些方法在准确生成并优化形式化规约以确保正确性方面仍面临挑战。为解决这些问题,本文提出LOGIC-LM++,作为LOGIC-LM的改进版本。该方法利用LLM进行成对比较的能力,实现对LLM所提优化建议的评估。实验表明,在FOLIO、ProofWriter和AR-LSAT三个自然语言推理数据集上,LOGIC-LM++在标准提示、思维链提示及原LOGIC-LM方法上的平均性能分别提升18.5%、12.3%和5%,显著优于LOGIC-LM及其他当代技术。