Large Language Models (LLMs) have achieved unprecedented fluency but remain susceptible to "hallucinations" - the generation of factually incorrect or ungrounded content. This limitation is particularly critical in high-stakes domains where reliability is paramount. We propose a domain-grounded tiered retrieval and verification architecture designed to systematically intercept factual inaccuracies by shifting LLMs from stochastic pattern-matchers to verified truth-seekers. The proposed framework utilizes a four-phase, self-regulating pipeline implemented via LangGraph: (I) Intrinsic Verification with Early-Exit logic to optimize compute, (II) Adaptive Search Routing utilizing a Domain Detector to target subject-specific archives, (III) Corrective Document Grading (CRAG) to filter irrelevant context, and (IV) Extrinsic Regeneration followed by atomic claim-level verification. The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA. Empirical results demonstrate that the pipeline consistently outperforms zero-shot baselines across all environments. Win rates peaked at 83.7% in TimeQA v2 and 78.0% in MMLU Global Facts, confirming high efficacy in domains requiring granular temporal and numerical precision. Groundedness scores remained robustly stable between 78.8% and 86.4% across factual-answer rows. While the architecture provides a robust fail-safe for misinformation, a persistent failure mode of "False-Premise Overclaiming" was identified. These findings provide a detailed empirical characterization of multi-stage RAG behavior and suggest that future work should prioritize pre-retrieval "answerability" nodes to further bridge the reliability gap in conversational AI.
翻译:大语言模型(LLMs)已实现前所未有的流畅性,但仍易产生"幻觉"——即生成事实上不正确或缺乏依据的内容。这一局限在可靠性至上的高风险领域尤为关键。本文提出一种基于领域约束的分层检索与验证架构,旨在通过将大语言模型从随机模式匹配器转变为可验证的真理探寻者,从而系统性地拦截事实性错误。该框架利用LangGraph实现了一个包含四个阶段的自调节流水线:(I)具备早停逻辑的内在验证以优化计算,(II)利用领域检测器进行自适应搜索路由以定位特定学科档案,(III)纠正性文档评分以过滤无关上下文,(IV)外部重建后接原子声明级验证。该系统在来自五个不同基准测试的650个查询上进行了评估:TimeQA v2、FreshQA v2、HaluEval General、MMLU Global Facts和TruthfulQA。实证结果表明,该流水线在所有环境下均持续优于零样本基线。在TimeQA v2上胜率达83.7%,在MMLU Global Facts上为78.0%,证实了在需要精细时间与数值精确性的领域具有高效性。在事实性答案行中,基于依据的得分稳定保持在78.8%至86.4%之间。尽管该架构为错误信息提供了稳健的故障安全机制,但仍识别出"错误前提过度断言"这一持续性失败模式。这些发现为多阶段检索增强生成行为提供了详细的实证特征描述,并表明未来工作应优先关注检索前"可回答性"节点,以进一步弥合对话式AI中的可靠性差距。