Improving Fairness of Large Language Model-Based ICU Mortality Prediction via Case-Based Prompting

Accurately predicting mortality risk in intensive care unit (ICU) patients is essential for clinical decision-making. Although large language models (LLMs) show strong potential in structured medical prediction tasks, their outputs may exhibit biases related to demographic attributes such as sex, age, and race, limiting their reliability in fairness-critical clinical settings. Existing debiasing methods often degrade predictive performance, making it difficult to balance fairness and accuracy. In this study, we systematically analyze fairness issues in LLM-based ICU mortality prediction and propose a clinically adaptive prompting framework that improves both performance and fairness without model retraining. We first design a multi-dimensional bias assessment scheme to identify subgroup disparities. Based on this, we introduce CAse Prompting (CAP), a training-free framework that integrates existing debiasing strategies and further guides models using similar historical misprediction cases paired with correct outcomes to correct biased reasoning. We evaluate CAP on the MIMIC-IV dataset. Results show that AUROC improves from 0.806 to 0.873 and AUPRC from 0.497 to 0.694. Meanwhile, prediction disparities are substantially reduced across demographic groups, with reductions exceeding 90% in sex and certain White-Black comparisons. Feature reliance analysis further reveals highly consistent attention patterns across groups, with similarity above 0.98. These findings demonstrate that fairness and performance in LLM-based clinical prediction can be jointly optimized through carefully designed prompting, offering a practical paradigm for developing reliable and equitable clinical decision-support systems.

翻译：准确预测重症监护病房（ICU）患者的死亡风险对临床决策至关重要。尽管大型语言模型（LLM）在结构化医疗预测任务中展现出强大潜力，但其输出可能表现出与性别、年龄和种族等人口统计学属性相关的偏差，限制了其在注重公平性的临床环境中的可靠性。现有去偏方法往往降低预测性能，难以平衡公平性与准确性。本研究系统分析了基于LLM的ICU死亡率预测中的公平性问题，并提出一种无需模型重训练即可提升性能与公平性的临床自适应提示框架。我们首先设计多维度偏差评估方案以识别子群体差异，进而提出无训练框架CAP（Case Prompting），该框架整合现有去偏策略，并利用类似历史误预测案例及其正确结果引导模型修正有偏推理。在MIMIC-IV数据集上评估CAP，结果显示AUROC从0.806提升至0.873，AUPRC从0.497提升至0.694；同时，各人口群体间的预测差异大幅降低，性别及特定白人与黑人群体比较中降幅超90%。特征依赖性分析进一步揭示各群体注意力模式高度一致，相似度超过0.98。这些发现表明，通过精心设计的提示可同时优化基于LLM的临床预测公平性与性能，为开发可靠、公平的临床决策支持系统提供实用范式。