大型语言模型在医疗查询中对用户驱动因素的敏感性 (Susceptibility of Large Language Models to User-Driven Factors in Medical Queries)

Large language models (LLMs) are increasingly used in healthcare, but their reliability is heavily influenced by user-driven factors such as question phrasing and the completeness of clinical information. In this study, we examined how misinformation framing, source authority, model persona, and omission of key clinical details affect the diagnostic accuracy and reliability of LLM outputs. We conducted two experiments: one introducing misleading external opinions with varying assertiveness (perturbation test), and another removing specific categories of patient information (ablation test). Using public datasets (MedQA and Medbullets), we evaluated proprietary models (GPT-4o, Claude 3.5 Sonnet, Claude 3.5 Haiku, Gemini 1.5 Pro, Gemini 1.5 Flash) and open-source models (LLaMA 3 8B, LLaMA 3 Med42 8B, DeepSeek R1 8B). All models were vulnerable to user-driven misinformation, with proprietary models especially affected by definitive and authoritative language. Assertive tone had the greatest negative impact on accuracy. In the ablation test, omitting physical exam findings and lab results caused the most significant performance drop. Although proprietary models had higher baseline accuracy, their performance declined sharply under misinformation. These results highlight the need for well-structured prompts and complete clinical context. Users should avoid authoritative framing of misinformation and provide full clinical details, especially for complex cases.

翻译：大型语言模型（LLMs）在医疗健康领域的应用日益广泛，但其可靠性深受用户驱动因素影响，例如问题表述方式和临床信息的完整性。本研究考察了错误信息框架、信源权威性、模型角色设定以及关键临床细节的缺失如何影响LLMs的诊断准确性和输出可靠性。我们进行了两项实验：一项引入具有不同断言程度的误导性外部观点（扰动测试），另一项则移除特定类别的患者信息（消融测试）。使用公开数据集（MedQA和Medbullets），我们评估了专有模型（GPT-4o、Claude 3.5 Sonnet、Claude 3.5 Haiku、Gemini 1.5 Pro、Gemini 1.5 Flash）和开源模型（LLaMA 3 8B、LLaMA 3 Med42 8B、DeepSeek R1 8B）。所有模型均易受用户驱动的错误信息影响，其中专有模型尤其受到确定性权威语言的影响。断言性语气对准确性的负面影响最为显著。在消融测试中，省略体格检查结果和实验室数据导致性能下降最为严重。尽管专有模型具有更高的基线准确性，但在错误信息影响下其性能急剧下降。这些结果凸显了结构化提示词和完整临床背景的必要性。用户应避免以权威性框架表述错误信息，并提供完整的临床细节，尤其在处理复杂病例时。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/