A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.

翻译：大语言模型在医疗问答领域展现出潜力，但其临床应用受限于验证能力薄弱、证据支撑不足以及置信度信号不可靠等问题。本文提出一种多智能体医疗问答框架，通过整合互补性大语言模型与证据检索、不确定性估计及偏误检测机制，以提升回答的可靠性。我们的方法包含两个阶段。首先，我们在基于MedQuAD衍生的医疗问答数据（涵盖美国国立卫生研究院多个领域的2万+问答对）上对三个代表性大语言模型家族（GPT、LLaMA和DeepSeek R1）进行微调，并基准测试其生成质量。DeepSeek R1取得了最优异的分数（ROUGE-1 0.536 ± 0.04；ROUGE-2 0.226 ± 0.03；BLEU 0.098 ± 0.018），并在零样本评估中显著优于专业生物医学基线模型BioGPT。其次，我们实现了一个模块化的多智能体流程：临床推理智能体（微调LLaMA）生成结构化解释，证据检索智能体查询PubMed以将回答锚定于最新文献，优化智能体（DeepSeek R1）提升清晰度与事实一致性；对于高风险或高不确定性案例，系统会触发可选的人工验证路径。安全机制包括蒙特卡洛Dropout和基于困惑度的不确定性评分，以及由LIME/SHAP分析支持的词汇与基于情感的偏误检测。在评估中，完整系统达到了87%的准确率，相关性约为0.80，且证据增强相较于基础回答降低了不确定性（困惑度4.13），在报告配置下平均端到端延迟为36.5秒。总体而言，结果表明智能体专业化与验证层能够缓解单一模型的关键局限，为循证与偏误感知的医疗AI提供了一个实用且可扩展的设计方案。