Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations. We introduce DeVisE (Demographics and Vital signs Evaluation), a behavioral testing framework that probes fine-grained clinical understanding through controlled counterfactuals. Using intensive care unit (ICU) discharge notes from MIMIC-IV, we construct both raw (real-world) and template-based (synthetic) variants with single-variable perturbations in demographic (age, gender, ethnicity) and vital sign attributes. We evaluate eight LLMs, spanning general-purpose and medical variants, under zero-shot setting. Model behavior is analyzed through (1) input-level sensitivity, capturing how counterfactuals alter perplexity, and (2) downstream reasoning, measuring their effect on predicted ICU length-of-stay and mortality. Overall, our results show that standard task metrics obscure clinically relevant differences in model behavior, with models differing substantially in how consistently and proportionally they adjust predictions to counterfactual perturbations.
翻译:大语言模型(LLM)在临床决策支持中的应用日益广泛,然而当前的评估方法很少能揭示其输出是反映了真正的医学推理还是表面的相关性。本文提出DeVisE(人口统计学与生命体征评估框架),这是一种通过受控反事实来探测细粒度临床理解的行为测试框架。利用MIMIC-IV数据库中的重症监护室(ICU)出院记录,我们构建了原始(真实世界)和基于模板(合成)的两种变体,其中对人口统计学特征(年龄、性别、种族)和生命体征属性进行了单变量扰动。我们在零样本设置下评估了八个LLM,涵盖通用模型和医学专用变体。模型行为通过以下两方面进行分析:(1)输入级敏感性,捕捉反事实如何改变困惑度;(2)下游推理,测量其对预测的ICU住院时长和死亡率的影响。总体而言,我们的结果表明,标准任务指标掩盖了模型行为中具有临床相关性的差异,不同模型在如何一致且成比例地根据反事实扰动调整预测方面存在显著差异。