Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but exhibit problems with logical consistency in their output. How can we harness LLMs' broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We evaluate the method empirically using datasets derived from the short-form factuality benchmarks GPQA and SimpleQA, showing that bilateral factuality evaluation improves macro-F1 over a unilateral baseline by roughly 6 percentage points on both benchmarks (at the cost of reduced coverage, as abstention is triggered on inconsistent or uncertain cases). We further describe a proof-of-concept tableau reasoner implementing the method, and apply it to a medication-safety knowledge base of 228 asserted and 712 inferred statements: the system detects 92 gluts corresponding to medically significant errors (e.g., opioids inferred as non-addictive, beta-blockers inferred as safe in asthma) while remaining satisfiable, demonstrating that contradictions are localized rather than causing logical explosion. Unlike prior work, our method offers a theoretical framework with a practical implementation for neurosymbolic reasoning that leverages an LLM's knowledge while preserving the underlying logic's soundness and completeness properties.
翻译:大语言模型(LLM)在自然语言理解与生成方面展现出卓越能力,但其输出存在逻辑一致性问题。如何在利用LLM广阔覆盖范围的参数化知识实现形式化推理的同时,规避其不一致性?本文提出一种方法,将大语言模型直接集成到次协调逻辑形式语义的解释函数中。我们通过从短文本事实性基准测试GPQA和SimpleQA导出的数据集进行实证评估,结果表明双边事实性评估在两个基准上的宏F1分数较单边基线提升约6个百分点(代价是覆盖率降低——因不一致或不确定案例触发弃权)。我们进一步实现了该方法的概念验证表格推理器,并将其应用于包含228条断言和712条推理语句的用药安全知识库:系统检测到92处对应重大医学错误(例如阿片类药物被推理为无成瘾性、哮喘患者中使用β受体阻滞剂被推理为安全)的"爆炸点",同时保持可满足性,证明矛盾被局部化而非引发逻辑爆炸。与现有工作不同,我们的方法提供了兼具理论框架与实践实现的神经符号推理方案,在利用大语言模型知识的同时,保持了底层逻辑的可靠性与完备性特性。