Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

from arxiv, 43 pages, 14 tables, 4 figures. Accepted to the 19th Conference on Neurosymbolic Learning and Reasoning (NeSy 2025); to appear Neurosymbolic Artifical Intelligence Special Issue on NeSy 2025 Extended Papers

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but exhibit problems with logical consistency in their output. How can we harness LLMs' broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We evaluate the method empirically using datasets derived from the short-form factuality benchmarks GPQA and SimpleQA, showing that bilateral factuality evaluation improves macro-F1 over a unilateral baseline by roughly 6 percentage points on both benchmarks (at the cost of reduced coverage, as abstention is triggered on inconsistent or uncertain cases). We further describe a proof-of-concept tableau reasoner implementing the method, and apply it to a medication-safety knowledge base of 228 asserted and 712 inferred statements: the system detects 92 gluts corresponding to medically significant errors (e.g., opioids inferred as non-addictive, beta-blockers inferred as safe in asthma) while remaining satisfiable, demonstrating that contradictions are localized rather than causing logical explosion. Unlike prior work, our method offers a theoretical framework with a practical implementation for neurosymbolic reasoning that leverages an LLM's knowledge while preserving the underlying logic's soundness and completeness properties.

翻译：大语言模型（LLM）在自然语言理解与生成方面展现出卓越能力，但其输出存在逻辑一致性问题。如何在利用LLM广阔覆盖范围的参数化知识实现形式化推理的同时，规避其不一致性？本文提出一种方法，将大语言模型直接集成到次协调逻辑形式语义的解释函数中。我们通过从短文本事实性基准测试GPQA和SimpleQA导出的数据集进行实证评估，结果表明双边事实性评估在两个基准上的宏F1分数较单边基线提升约6个百分点（代价是覆盖率降低——因不一致或不确定案例触发弃权）。我们进一步实现了该方法的概念验证表格推理器，并将其应用于包含228条断言和712条推理语句的用药安全知识库：系统检测到92处对应重大医学错误（例如阿片类药物被推理为无成瘾性、哮喘患者中使用β受体阻滞剂被推理为安全）的"爆炸点"，同时保持可满足性，证明矛盾被局部化而非引发逻辑爆炸。与现有工作不同，我们的方法提供了兼具理论框架与实践实现的神经符号推理方案，在利用大语言模型知识的同时，保持了底层逻辑的可靠性与完备性特性。