Enhancing Diagnostic Accuracy through Multi-Agent Conversations: Using Large Language Models to Mitigate Cognitive Bias

Background: Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field. Objective: This study explores the role of large language models (LLMs) in mitigating these biases through the utilization of a multi-agent framework. We simulate the clinical decision-making processes through multi-agent conversation and evaluate its efficacy in improving diagnostic accuracy. Methods: A total of 16 published and unpublished case reports where cognitive biases have resulted in misdiagnoses were identified from the literature. In the multi-agent framework, we leveraged GPT-4 to facilitate interactions among four simulated agents to replicate clinical team dynamics. Each agent has a distinct role: 1) To make the final diagnosis after considering the discussions, 2) The devil's advocate and correct confirmation and anchoring bias, 3) The tutor and facilitator of the discussion to reduce premature closure bias, and 4) To record and summarize the findings. A total of 80 simulations were evaluated for the accuracy of initial diagnosis, top differential diagnosis and final two differential diagnoses. Results: In a total of 80 responses evaluating both initial and final diagnoses, the initial diagnosis had an accuracy of 0% (0/80), but following multi-agent discussions, the accuracy for the top differential diagnosis increased to 71.3% (57/80), and for the final two differential diagnoses, to 80.0% (64/80). Conclusions: The framework demonstrated an ability to re-evaluate and correct misconceptions, even in scenarios with misleading initial investigations. The LLM-driven multi-agent conversation framework shows promise in enhancing diagnostic accuracy in diagnostically challenging medical scenarios.

翻译：背景：临床决策中的认知偏差是导致诊断错误和患者预后不佳的重要原因。解决这些偏差是医学领域的一项重大挑战。目的：本研究探讨了大语言模型在基于多智能体框架减轻这些偏差中的作用。我们通过多智能体对话模拟临床决策过程，并评估其在提高诊断准确性方面的有效性。方法：从文献中识别出16例因认知偏差导致误诊的已发表和未发表病例报告。在多智能体框架中，我们利用GPT-4促进四个模拟智能体之间的交互，以复现临床团队动态。每个智能体具有不同角色：1) 在讨论后做出最终诊断；2) 扮演“魔鬼代言人”，纠正确认偏差和锚定偏差；3) 作为讨论的导师和促进者，减少过早结束偏差；4) 记录并总结发现。共评估了80次模拟中初始诊断、首要鉴别诊断及前两个鉴别诊断的准确性。结果：在评估初始和最终诊断的80次响应中，初始诊断准确率为0%（0/80），而经过多智能体讨论后，首要鉴别诊断准确率提升至71.3%（57/80），前两个鉴别诊断准确率提升至80.0%（64/80）。结论：该框架在初始检查具有误导性的场景中仍表现出重新评估和纠正错误认知的能力。基于大语言模型的多智能体对话框架在提高诊断困难的医学场景中的诊断准确性方面显示出潜力。