Cancer-Myth：评估大型语言模型在具有错误预设的患者问题上的表现 (Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions)

Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet -- corrects these false presuppositions more than $43\%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.

翻译：癌症患者越来越多地转向大型语言模型（LLMs）获取医疗信息，这使得评估这些模型处理复杂、个性化问题的能力变得至关重要。然而，当前的医学基准测试主要关注医学考试或消费者搜索的问题，并未基于包含患者细节的真实患者问题来评估LLMs。在本文中，我们首先邀请三位血液肿瘤科医生评估从真实患者中提取的癌症相关问题。虽然LLM的回应通常准确，但这些模型经常未能识别或处理问题中的错误预设，这对安全的医疗决策构成了风险。为了系统研究这一局限性，我们引入了Cancer-Myth，这是一个经过专家验证的对抗性数据集，包含585个具有错误预设的癌症相关问题。在此基准测试中，包括GPT-5、Gemini-2.5-Pro和Claude-4-Sonnet在内的前沿LLMs纠正这些错误预设的频率不超过$43\\%$。为了研究缓解策略，我们进一步构建了一个包含150个问题的Cancer-Myth-NFP集合，其中医生确认不存在错误预设。我们发现典型的缓解策略，例如通过GEPA优化添加预防性提示，可以将Cancer-Myth上的准确率提升至$80\\%$，但代价是在$41\\%$的Cancer-Myth-NFP问题中错误识别预设，并导致其他医学基准测试上的相对性能下降$10\\%$。这些发现突显了LLMs可靠性中的一个关键差距，表明仅靠提示并非解决错误预设的可靠方法，并强调了在医疗AI系统中需要更强大的安全保障。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/