Large language models (LLMs) are increasingly used in healthcare, but their reliability is heavily influenced by user-driven factors such as question phrasing and the completeness of clinical information. In this study, we examined how misinformation framing, source authority, model persona, and omission of key clinical details affect the diagnostic accuracy and reliability of LLM outputs. We conducted two experiments: one introducing misleading external opinions with varying assertiveness (perturbation test), and another removing specific categories of patient information (ablation test). Using public datasets (MedQA and Medbullets), we evaluated proprietary models (GPT-4o, Claude 3.5 Sonnet, Claude 3.5 Haiku, Gemini 1.5 Pro, Gemini 1.5 Flash) and open-source models (LLaMA 3 8B, LLaMA 3 Med42 8B, DeepSeek R1 8B). All models were vulnerable to user-driven misinformation, with proprietary models especially affected by definitive and authoritative language. Assertive tone had the greatest negative impact on accuracy. In the ablation test, omitting physical exam findings and lab results caused the most significant performance drop. Although proprietary models had higher baseline accuracy, their performance declined sharply under misinformation. These results highlight the need for well-structured prompts and complete clinical context. Users should avoid authoritative framing of misinformation and provide full clinical details, especially for complex cases.
翻译:大型语言模型(LLMs)在医疗健康领域的应用日益广泛,但其可靠性深受用户驱动因素影响,例如问题表述方式和临床信息的完整性。本研究考察了错误信息框架、信源权威性、模型角色设定以及关键临床细节的缺失如何影响LLMs的诊断准确性和输出可靠性。我们进行了两项实验:一项引入具有不同断言程度的误导性外部观点(扰动测试),另一项则移除特定类别的患者信息(消融测试)。使用公开数据集(MedQA和Medbullets),我们评估了专有模型(GPT-4o、Claude 3.5 Sonnet、Claude 3.5 Haiku、Gemini 1.5 Pro、Gemini 1.5 Flash)和开源模型(LLaMA 3 8B、LLaMA 3 Med42 8B、DeepSeek R1 8B)。所有模型均易受用户驱动的错误信息影响,其中专有模型尤其受到确定性权威语言的影响。断言性语气对准确性的负面影响最为显著。在消融测试中,省略体格检查结果和实验室数据导致性能下降最为严重。尽管专有模型具有更高的基线准确性,但在错误信息影响下其性能急剧下降。这些结果凸显了结构化提示词和完整临床背景的必要性。用户应避免以权威性框架表述错误信息,并提供完整的临床细节,尤其在处理复杂病例时。