Large Language Models (LLMs) have shown remarkable proficiency in language understanding and have been successfully applied to a variety of real-world tasks through task-specific fine-tuning or prompt engineering. Despite these advancements, it remains an open question whether LLMs are fundamentally capable of reasoning and planning, or if they primarily rely on recalling and synthesizing information from their training data. In our research, we introduce a novel task -- Minesweeper -- specifically designed in a format unfamiliar to LLMs and absent from their training datasets. This task challenges LLMs to identify the locations of mines based on numerical clues provided by adjacent opened cells. Successfully completing this task requires an understanding of each cell's state, discerning spatial relationships between the clues and mines, and strategizing actions based on logical deductions drawn from the arrangement of the cells. Our experiments, including trials with the advanced GPT-4 model, indicate that while LLMs possess the foundational abilities required for this task, they struggle to integrate these into a coherent, multi-step logical reasoning process needed to solve Minesweeper. These findings highlight the need for further research to understand and nature of reasoning capabilities in LLMs under similar circumstances, and to explore pathways towards more sophisticated AI reasoning and planning models.
翻译:大型语言模型(LLMs)在语言理解方面展现出卓越的能力,并通过任务特定微调或提示工程成功应用于多种现实世界任务。尽管取得了这些进展,LLMs本质上是否具备推理与规划能力,还是主要依赖从其训练数据中回忆和综合信息,仍是尚未解决的问题。在本研究中,我们引入了一个新颖的任务——扫雷——该任务采用一种LLMs不熟悉且其训练数据集中不存在的格式专门设计。此任务要求LLMs根据相邻已翻开单元格提供的数字线索,识别地雷位置。成功完成此任务需要理解每个单元格的状态,识别线索与地雷之间的空间关系,并根据单元格排列的逻辑推理制定行动策略。我们的实验(包括对先进GPT-4模型的测试)表明,尽管LLMs具备完成此任务所需的基础能力,但它们难以将这些能力整合为解决扫雷所需的连贯多步逻辑推理过程。这些发现凸显了需进一步研究以理解LLMs在类似情境下推理能力的本质,并探索通往更复杂AI推理与规划模型的路径。