Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE-inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic-uncertainty prompts and 93 fabrication-induction stress-test prompts. A three-stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs -- FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) -- aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation-observability trade-offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high-stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end-to-end THS reductions of -31.3% to -35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi-stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (-0.0709), confirming that observability-heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory-augmented multi-agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.
翻译:幻觉仍然是生产级LLM系统的主要可靠性障碍,特别是在多智能体流水线中,无根据的声明可能在各阶段间不受限制地传播。本文将受HOPE启发的嵌套学习架构、连续记忆系统(CMS)及语义相似性缓存适配至混合基准测试,该基准包含310个提示词,涵盖217个认知不确定性提示与93个虚构诱导压力测试提示。通过开放地板协议(OFP)编排的三阶段智能体流水线基于五项关键绩效指标——事实性声明密度(FCD)、事实依据引用数(FGR)、虚构声明免责频率(FDF)、显式情境化得分(ECS)与可观测性得分比(OSR)——聚合为总幻觉得分(THS),在五种权重配置下评估以研究缓解-可观测性权衡。FDF、ECS、OSR与FGR作为缓解信号被扣除,因此更负的THS值表示更强的缓解效果。前端智能体被配置为高随机性生成器(温度=1.0)以产生真实的幻觉基线,而二级审核员与三级审核员则作为渐进式校正器运行。这种非对称设计在五种权重配置下实现了-31.3%至-35.9%的端到端THS降低。语义缓存在930次潜在调用中实现440次缓存命中(命中率47.3%),将LLM调用次数降至490次,降低了能耗与二氧化碳当量足迹,使多阶段审核流水线在生产规模下具备运营可行性。极致可观测性配置获得最负的最终THS值(-0.0709),证实高可观测性配置可增强而非损害缓解效果。这些发现表明,记忆增强型多智能体设计能够在无需模型重训练的情况下,共同提升事实可靠性、运营效率与可审计性。