Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM -- including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet -- corrects these false presuppositions more than $43\%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
翻译:癌症患者越来越多地转向大型语言模型(LLMs)获取医疗信息,这使得评估这些模型处理复杂、个性化问题的能力变得至关重要。然而,当前的医学基准测试主要关注医学考试或消费者搜索的问题,并未基于包含患者细节的真实患者问题来评估LLMs。在本文中,我们首先邀请三位血液肿瘤科医生评估从真实患者中提取的癌症相关问题。虽然LLM的回应通常准确,但这些模型经常未能识别或处理问题中的错误预设,这对安全的医疗决策构成了风险。为了系统研究这一局限性,我们引入了Cancer-Myth,这是一个经过专家验证的对抗性数据集,包含585个具有错误预设的癌症相关问题。在此基准测试中,包括GPT-5、Gemini-2.5-Pro和Claude-4-Sonnet在内的前沿LLMs纠正这些错误预设的频率不超过$43\\%$。为了研究缓解策略,我们进一步构建了一个包含150个问题的Cancer-Myth-NFP集合,其中医生确认不存在错误预设。我们发现典型的缓解策略,例如通过GEPA优化添加预防性提示,可以将Cancer-Myth上的准确率提升至$80\\%$,但代价是在$41\\%$的Cancer-Myth-NFP问题中错误识别预设,并导致其他医学基准测试上的相对性能下降$10\\%$。这些发现突显了LLMs可靠性中的一个关键差距,表明仅靠提示并非解决错误预设的可靠方法,并强调了在医疗AI系统中需要更强大的安全保障。