Large language models (LLMs) have the potential to transform medicine, but real-world clinical scenarios contain extraneous information that can hinder performance. The rise of assistive technologies like ambient dictation, which automatically generates draft notes from live patient encounters, has the potential to introduce additional noise making it crucial to assess the ability of LLM's to filter relevant data. To investigate this, we developed MedDistractQA, a benchmark using USMLE-style questions embedded with simulated real-world distractions. Our findings show that distracting statements (polysemous words with clinical meanings used in a non-clinical context or references to unrelated health conditions) can reduce LLM accuracy by up to 17.9%. Commonly proposed solutions to improve model performance such as retrieval-augmented generation (RAG) and medical fine-tuning did not change this effect and in some cases introduced their own confounders and further degraded performance. Our findings suggest that LLMs natively lack the logical mechanisms necessary to distinguish relevant from irrelevant clinical information, posing challenges for real-world applications. MedDistractQA and our results highlights the need for robust mitigation strategies to enhance LLM resilience to extraneous information.
翻译:大语言模型(LLMs)具有变革医学的潜力,但现实临床场景中包含可能影响性能的无关信息。诸如环境听写这类辅助技术的兴起——能够从实时患者接诊中自动生成草稿记录——可能引入额外的噪声,这使得评估LLMs过滤相关数据的能力变得至关重要。为探究此问题,我们开发了MedDistractQA基准测试,该测试采用嵌入模拟现实世界干扰的USMLE风格问题。我们的研究结果表明,干扰性陈述(具有临床含义的多义词在非临床语境中使用,或提及无关健康状况)可使LLM准确率降低高达17.9%。通常提出的提升模型性能的解决方案,如检索增强生成(RAG)和医学微调,并未改变此效应,在某些情况下甚至引入了自身的混淆因素并进一步降低了性能。我们的发现表明,LLMs本身缺乏区分临床相关信息与无关信息所需的逻辑机制,这为其实际应用带来了挑战。MedDistractQA及我们的结果凸显了需要制定稳健的缓解策略以增强LLMs对无关信息的抵御能力。