Prior research has enhanced the ability of Large Language Models (LLMs) to solve logic puzzles using techniques such as chain-of-thought prompting or introducing a symbolic representation. These frameworks are still usually insufficient to solve complicated logical problems, such as Zebra puzzles, due to the inherent complexity of translating natural language clues into logical statements. We introduce a multi-agent system, ZPS, that integrates LLMs with an off the shelf theorem prover. This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts, generating SMT (Satisfiability Modulo Theories) code to solve them with a theorem prover, and using feedback between the agents to repeatedly improve their answers. We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study. Our approach shows improvement in all three LLMs we tested, with GPT-4 showing 166% improvement in the number of fully correct solutions.
翻译:先前的研究通过思维链提示或引入符号表示等技术,增强了大型语言模型(LLMs)解决逻辑谜题的能力。然而,由于将自然语言线索转化为逻辑语句的内在复杂性,这些框架通常仍不足以解决复杂的逻辑问题,例如斑马谜题。我们引入了一个多智能体系统 ZPS,该系统将 LLMs 与现成的定理证明器集成在一起。该系统通过将问题分解为更小、更易处理的部分,生成 SMT(可满足性模理论)代码以利用定理证明器求解,并利用智能体之间的反馈来反复改进答案,从而应对复杂的谜题求解任务。我们还引入了一个自动网格谜题评分器来评估我们谜题解答的正确性,并通过用户研究评估证明了该自动评分器的可靠性。我们的方法在我们测试的所有三种 LLMs 上都显示出改进,其中 GPT-4 在完全正确解的数量上提升了 166%。