Artificial intelligence (AI) has made remarkable progress across various domains, with large language models like ChatGPT gaining substantial attention for their human-like text-generation capabilities. Despite these achievements, spatial reasoning remains a significant challenge for these models. Benchmarks like StepGame evaluate AI spatial reasoning, where ChatGPT has shown unsatisfactory performance. However, the presence of template errors in the benchmark has an impact on the evaluation results. Thus there is potential for ChatGPT to perform better if these template errors are addressed, leading to more accurate assessments of its spatial reasoning capabilities. In this study, we refine the StepGame benchmark, providing a more accurate dataset for model evaluation. We analyze GPT's spatial reasoning performance on the rectified benchmark, identifying proficiency in mapping natural language text to spatial relations but limitations in multi-hop reasoning. We provide a flawless solution to the benchmark by combining template-to-relation mapping with logic-based reasoning. This combination demonstrates proficiency in performing qualitative reasoning on StepGame without encountering any errors. We then address the limitations of GPT models in spatial reasoning. We deploy Chain-of-thought and Tree-of-thoughts prompting strategies, offering insights into GPT's ``cognitive process", and achieving remarkable improvements in accuracy. Our investigation not only sheds light on model deficiencies but also proposes enhancements, contributing to the advancement of AI with more robust spatial reasoning capabilities.
翻译:人工智能(AI)在多个领域取得了显著进展,诸如ChatGPT等大语言模型凭借其类人文本生成能力备受关注。尽管取得这些成就,空间推理仍是这些模型面临的重大挑战。尽管StepGame等基准测试被用于评估AI空间推理能力,但ChatGPT在此类测试中表现欠佳。然而,基准测试中的模板错误影响了评估结果。若修正这些模板错误,ChatGPT的空间推理能力或将得到更准确的评估。本研究对StepGame基准进行优化,提供更精确的数据集用于模型评估。我们分析了GPT在修正后基准上的空间推理表现,发现其擅长将自然语言文本映射为空间关系,但在多跳推理中存在局限。通过将模板-关系映射与基于逻辑的推理相结合,我们为该基准提供了无差错解决方案。这种组合方法在StepGame任务中展现出完美的定性推理能力。随后,我们针对GPT模型在空间推理中的局限,部署了思维链与思维树提示策略,深入剖析GPT的"认知过程",并显著提升推理准确率。本研究不仅揭示了模型缺陷,更提出了改进方案,推动具备更强空间推理能力的AI系统发展。