Hallucination Cascade: Analyzing Error Propagation in Multi-Agent LLM Systems

Large Language Models (LLMs) generate fluent text but remain vulnerable to hallucinations, producing unsupported, inconsistent, and factually incorrect claims. Most prior work treats hallucination as a static property of isolated outputs. In multi-agent LLM systems, however, responses are exchanged across agents, revised through sequential stages, and reused as context for later reasoning. Hallucination, therefore, becomes a dynamic process shaped by interaction history, cascade depth, and model heterogeneity. This paper analyzes hallucination dynamics in multi-agent LLM cascades by tracking claim-level factual inconsistencies across sequential agent interactions. We conduct 500 cascade experiments across 10 knowledge domains using GPT-5.3, DeepSeek-V3, and LLaMA-3-70B-Instruct, yielding 1,250 evaluated responses. Results show that deeper cascades reduce the normalized hallucination score from 0.422 at the first agent to 0.272 at the final agent in 3-agent chains, with an amplification factor of 0.644, indicating net attenuation. This reduction is accompanied by a decline in factual accuracy from 0.789 to 0.769, revealing a trade-off between hallucination suppression and factual preservation. Transition-level analysis shows that each agent-to-agent refinement reduces hallucination by an average of 0.072, with small but consistent losses in factual consistency and response quality. Model-level results reveal reliability-efficiency trade-offs: LLaMA-3-70B-Instruct achieves the lowest hallucination score, whereas GPT-5.3 provides faster generation with a higher hallucination rate. Domain-level analysis shows that hallucination varies with topic complexity, with lower scores in well-grounded scientific domains and higher scores in more abstract domains.

翻译：大型语言模型(LLMs)能生成流畅文本，但仍易产生幻觉，输出无依据、不一致且事实错误的断言。以往研究多将幻觉视为孤立输出的静态属性，但在多智能体LLM系统中，响应在智能体间交换，经过序列化阶段修订，并作为后续推理的上下文被复用。因此，幻觉成为一种由交互历史、级联深度和模型异质性共同塑造的动态过程。本文通过追踪序列化智能体交互中声明级的事实不一致性，分析了多智能体LLM级联中的幻觉动态。我们使用GPT-5.3、DeepSeek-V3和LLaMA-3-70B-Instruct，在10个知识领域进行了500次级联实验，生成了1,250条评估响应。结果表明：在3智能体链中，更深级联使归一化幻觉评分从第一个智能体的0.422降至最终智能体的0.272，放大因子为0.644，呈现净衰减效应。这一衰减伴随事实准确性从0.789降至0.769，揭示了幻觉抑制与事实保留之间的权衡。过渡级分析显示，每次智能体间优化平均降低幻觉评分0.072，但带来事实一致性和响应质量的微小持续损失。模型级结果揭示了可靠性-效率权衡：LLaMA-3-70B-Instruct实现最低幻觉评分，而GPT-5.3生成速度更快但幻觉率更高。领域级分析表明，幻觉水平随主题复杂度变化：在论证严谨的科学领域得分较低，在更抽象的领域得分较高。