Large language models are increasingly being used in patient-facing medical question answering, where hallucinated outputs can vary widely in potential harm. However, existing hallucination standards and evaluation metrics focus primarily on factual correctness, treating all errors as equally severe. This obscures clinically relevant failure modes, particularly when models generate unsupported but actionable medical language. We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language, including treatment directives, contraindications, urgency cues, and mentions of high-risk medications. Rather than assessing clinical correctness, our approach evaluates the potential impact of hallucinated content if acted upon. We further combine risk scoring with a relevance measure to identify high-risk, low-grounding failures. We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests. Our results show that models with similar surface-level behavior exhibit substantially different risk profiles and that standard evaluation metrics fail to capture these distinctions. These findings highlight the importance of incorporating risk sensitivity into hallucination evaluation and suggest that evaluation validity is critically dependent on task and prompt design.
翻译:大型语言模型正日益应用于面向患者的医疗问答场景,其中幻觉输出的潜在危害程度差异显著。然而现有的幻觉标准与评估指标主要关注事实准确性,将所有错误视为同等严重。这种做法掩盖了临床相关的失效模式,尤其当模型生成缺乏依据但具有可操作性的医疗建议时。我们提出一种风险敏感评估框架,通过识别风险承载语言(包括治疗指令、禁忌症、紧急程度提示以及高风险药物提及)来量化幻觉现象。与评估临床正确性不同,我们的方法评估幻觉内容若被执行可能产生的潜在影响。我们进一步将风险评分与相关性度量相结合,以识别高风险、低依据的失效案例。应用该框架对三种指令微调语言模型进行测试,使用精心设计的面向患者安全压力测试提示。结果表明,具有相似表层行为的模型展现出显著不同的风险特征,而标准评估指标无法捕捉这些差异。这些发现凸显了将风险敏感性纳入幻觉评估的重要性,并表明评估有效性关键取决于任务设计与提示工程。