Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.
翻译:校准不当的置信度分数是在临床环境中部署人工智能的实际障碍。一个始终过于自信的模型无法为决策延迟提供有用的信号。我们提出了一种多智能体框架,该框架结合了领域特定的专家智能体、两阶段验证和S分数加权融合,以改善医学多项选择问答中的校准和判别能力。四个专家智能体(呼吸科、心脏科、神经科、胃肠科)使用Qwen2.5-7B-Instruct生成独立诊断。每个诊断随后经历一个两阶段自我验证过程,该过程衡量内部一致性并生成专家置信度分数(S分数)。S分数驱动一种加权融合策略,以选择最终答案并校准报告的置信度。我们在四个实验设置上进行评估,涵盖MedQA-USMLE和MedMCQA中100题和250题的高分歧子集。校准改进是核心发现,在全部四个设置中,ECE降低了49-74%,包括更难的MedMCQA基准测试,即使绝对准确性受到知识密集型回忆需求的限制,这些改进仍然持续存在。在MedQA-250上,完整系统在59.2%的准确率下实现了ECE=0.091(比单专家基线降低74.4%)和AUROC=0.630(+0.056)。消融分析确定两阶段验证是校准的主要驱动力,而多智能体推理是准确性的主要驱动力。这些结果表明,基于一致性的验证可以在多样化的医学问题类型中产生更可靠的不确定性估计,为安全关键的临床AI应用中的决策延迟提供实用的置信信号。