Terminology substitution errors in clinical notes, where one medical term is replaced by a linguistically valid but clinically different term, pose a persistent challenge for automated error detection in healthcare. We introduce BLUEmed, a multi-agent debate framework augmented with hybrid Retrieval-Augmented Generation (RAG) that combines evidence-grounded reasoning with multi-perspective verification for clinical error detection. BLUEmed decomposes each clinical note into focused sub-queries, retrieves source-partitioned evidence through dense, sparse, and online retrieval, and assigns two domain expert agents distinct knowledge bases to produce independent analyses; when the experts disagree, a structured counter-argumentation round and cross-source adjudication resolve the conflict, followed by a cascading safety layer that filters common false-positive patterns. We evaluate BLUEmed on a clinical terminology substitution detection benchmark under both zero-shot and few-shot prompting with multiple backbone models spanning proprietary and open-source families. Experimental results show that BLUEmed achieves the best accuracy (69.13%), ROC-AUC (74.45%), and PR-AUC (72.44%) under few-shot prompting, outperforming both single-agent RAG and debate-only baselines. Further analyses across six backbone models and two prompting strategies confirm that retrieval augmentation and structured debate are complementary, and that the framework benefits most from models with sufficient instruction-following and clinical language understanding.
翻译:临床笔记中的术语替换错误(即一个医学概念被一个语言学有效但临床意义不同的术语所替代)对医疗领域的自动化错误检测构成了持续挑战。我们提出BLUEmed——一种结合混合检索增强生成(RAG)的多智能体辩论框架,该框架将基于证据的推理与多视角验证相结合,用于临床错误检测。BLUEmed将每份临床笔记分解为聚焦的子查询,通过稠密检索、稀疏检索和在线检索获取按来源划分的证据,并赋予两位领域专家智能体独立的知识库以生成独立分析;当专家出现分歧时,结构化反驳论证轮次与跨来源裁决机制用于消解冲突,随后通过级联安全层过滤常见假阳性模式。我们在临床术语替换检测基准上,使用多种主干模型(覆盖商业与开源系列)在零样本与少样本提示设置下对BLUEmed进行评估。实验结果表明,在少样本提示条件下,BLUEmed取得了最优的准确率(69.13%)、ROC-AUC(74.45%)与PR-AUC(72.44%),其性能优于单智能体RAG基线及仅依赖辩论的基线。进一步对六种主干模型与两种提示策略进行的分析证实,检索增强与结构化辩论具有互补性,且该框架在具备充分指令遵循能力与临床语言理解能力的模型中收益最大。