Trust but Verify: Mitigating Medical Hallucinations via Post-Hoc Adversarial Auditing and Multi-Agent Feedback Loops

Large Language Models (LLMs) are increasingly deployed in healthcare settings, yet their tendency to hallucinate poses risks when clinical decisions are involved. This study examine whether LLMs recommend recently banned or withdrawn pharmaceuticals when answering clinical questions and tests an agent-based method for reducing such errors. We developed a five-agent "Trust but Verify" system using a single LLM backbone. To measure regulatory knowledge obsolescence, we created an adversarial dataset of 103 clinical MCQs where historically correct answers now refer to banned substances. This scale ensures statistical significance across various therapeutic classes. We evaluated three open-access model families (GPT-OSS, Llama-3, Falcon-3) under vanilla and agentic conditions. Performance was measured via pointwise score, label accuracy, Hallucination Error Rate (HER), and Component Fidelity (CF) score. We also observed clinical safety regression in proprietary models. In default configurations, all models showed high hallucination rates, consistently selecting banned drugs that matched training data patterns. Our proposed agentic architecture reduced HER by approximately 53% across models. Pointwise scores shifted from -0.25 (unsafe recommendation) toward 0.0 (appropriate refusal). The safety audit intercepted dangerous outputs even when models' parametric knowledge favored the banned substance. The proposed multi-agent framework offers a model-agnostic method for enforcing regulatory compliance that prioritizes patient safety over fluent text generation. Our work demonstrates a practical approach for deploying autonomous AI systems in safety-critical healthcare settings. It shows how real-time regulatory data can be integrated into LLM pipelines to support clinical decision-making.

翻译：大型语言模型（LLMs）在医疗领域的应用日益广泛，但其幻觉倾向在临床决策场景中构成风险。本研究考察了LLMs在回答临床问题时是否会推荐近期被禁用或撤市的药物，并测试了一种基于智能体的错误减少方法。我们基于单一LLM主干构建了包含五个智能体的"信任但核查"系统。为衡量监管知识过时程度，我们创建了包含103道临床多选题的对抗性数据集，其中原本正确的历史答案现指向违禁物质。该规模确保了不同治疗类别统计显著性。我们评估了三个开源模型系列（GPT-OSS、Llama-3、Falcon-3）在原始与智能体条件下的表现，通过逐点评分、标签准确率、幻觉错误率（HER）和组件保真度（CF）分数衡量性能。我们还观察到专有模型的临床安全退化现象。在默认配置中，所有模型均表现出高幻觉率，持续选择与训练数据模式匹配的禁用药物。我们提出的智能体架构使各模型的HER降低了约53%。逐点评分从-0.25（不安全推荐）向0.0（恰当拒绝）偏移。即使模型的参数知识倾向于违禁物质，安全审计也能拦截危险输出。该多智能体框架提供了一种与模型无关的监管合规方法，将患者安全置于流畅文本生成之上。我们的工作展示了在安全关键的医疗场景中部署自主AI系统的实用方案，证明了如何将实时监管数据集成到LLM流程中以支持临床决策。