Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often struggle with spatial reasoning. This paper presents a novel neural-symbolic framework that enhances LLMs' spatial reasoning abilities through iterative feedback between LLMs and Answer Set Programming (ASP). We evaluate our approach on two benchmark datasets: StepGame and SparQA, implementing three distinct strategies: (1) direct prompting baseline, (2) Facts+Rules prompting, and (3) DSPy-based LLM+ASP pipeline with iterative refinement. Our experimental results demonstrate that the LLM+ASP pipeline significantly outperforms baseline methods, achieving an average 82% accuracy on StepGame and 69% on SparQA, marking improvements of 40-50% and 8-15% respectively over direct prompting. The success stems from three key innovations: (1) effective separation of semantic parsing and logical reasoning through a modular pipeline, (2) iterative feedback mechanism between LLMs and ASP solvers that improves program rate, and (3) robust error handling that addresses parsing, grounding, and solving failures. Additionally, we propose Facts+Rules as a lightweight alternative that achieves comparable performance on complex SparQA dataset, while reducing computational overhead.Our analysis across different LLM architectures (Deepseek, Llama3-70B, GPT-4.0 mini) demonstrates the framework's generalizability and provides insights into the trade-offs between implementation complexity and reasoning capability, contributing to the development of more interpretable and reliable AI systems.

翻译：大语言模型（LLMs）在各种任务中展现出卓越的能力，但在空间推理方面仍存在明显不足。本文提出一种新颖的神经符号框架，通过LLMs与答案集编程（ASP）之间的迭代反馈机制来增强其空间推理能力。我们在两个基准数据集（StepGame和SparQA）上评估了所提出的方法，并实现了三种不同策略：（1）直接提示基线，（2）事实+规则提示，以及（3）基于DSPy的LLM+ASP管道配合迭代优化。实验结果表明，LLM+ASP管道显著优于基线方法，在StepGame上平均准确率达到82%，在SparQA上达到69%，较直接提示方法分别提升40-50%和8-15%。该成功源于三项关键创新：（1）通过模块化管道实现语义解析与逻辑推理的有效分离，（2）LLMs与ASP求解器之间的迭代反馈机制提升了程序生成率，（3）针对解析、基础化及求解失败的鲁棒错误处理机制。此外，我们提出的事实+规则方法作为轻量级替代方案，在复杂的SparQA数据集上实现了可比性能，同时降低了计算开销。通过对不同LLM架构（Deepseek、Llama3-70B、GPT-4.0 mini）的跨模型分析，我们验证了该框架的泛化能力，并揭示了实现复杂度与推理能力之间的权衡关系，为开发更具可解释性和可靠性的AI系统提供了重要参考。