Large Language Models (LLMs) often exhibit factual inconsistencies and logical decay in extended, multi-turn dialogues, a challenge stemming from their reliance on static, pre-trained knowledge and an inability to reason adaptively over the dialogue history. Prevailing mitigation strategies, such as Retrieval-Augmented Generation (RAG) and agentic working memories, improve information recall but still engage with fundamentally static knowledge sources and follow pre-defined single reasoning path. This hinders their ability to preserve factual and logical consistency of their responses in multi-turn dialogues while the context evolves over time. To address this issue, we propose D-SMART, a model-agnostic framework designed to maintain multi-turn dialogue consistency by enabling LLMs to build and reason over a dynamic, structured representation of the conversational context. This is achieved via two synergistic components: (1) a Dynamic Structured Memory (DSM), which incrementally constructs and maintains an authoritative, OWL-compliant knowledge graph of the conversation; and (2) a Reasoning Tree (RT), which executes inferences as an explicit and traceable multi-step search over the graph. As the popular-used quality score (judged by GPT-4) can overlook logical flaws, we introduce new NLI-based metrics to better measure multi-turn dialogue consistency. Comprehensive experiments on the MT-Bench-101 benchmark show that D-SMART significantly outperforms state-of-the-art baselines, elevating the dialogue consistency score by over 48\% for both proprietary and open-source models, and notably improves the quality score of the latter by up to 10.1\%.
翻译:大语言模型(LLMs)在扩展的多轮对话中常表现出事实不一致与逻辑衰减问题,这一挑战源于其对静态预训练知识的依赖以及无法对对话历史进行自适应推理。现有的缓解策略,如检索增强生成(RAG)与智能体工作记忆,虽能提升信息召回能力,但仍基于本质上静态的知识源并遵循预定义的单一路径推理。这限制了它们在对话语境随时间演变时保持多轮对话中回应的逻辑一致性与事实准确性的能力。为解决该问题,我们提出了D-SMART——一个模型无关的框架,旨在通过使LLMs能够基于动态结构化的对话语境表征进行构建与推理,以维持多轮对话的一致性。该框架通过两个协同组件实现:(1)动态结构化记忆(DSM),可增量构建并维护符合OWL规范的权威对话知识图谱;(2)推理树(RT),通过对图谱进行显式、可追溯的多步搜索来执行推理。鉴于当前广泛使用的质量评分(由GPT-4评判)可能忽略逻辑缺陷,我们引入了基于自然语言推理的新评估指标以更精准地衡量多轮对话一致性。在MT-Bench-101基准上的综合实验表明,D-SMART显著优于现有先进基线方法,在闭源与开源模型上均将对话一致性分数提升超过48%,并将后者的质量评分最高提升10.1%。