Evaluating Large Language Models for Fair and Reliable Organ Allocation

Medical institutions are considering the use of LLMs in high-stakes clinical decision-making, such as organ allocation. In such sensitive use cases, evaluating fairness is imperative. However, existing evaluation methods often fall short; benchmarks are too simplistic to capture real-world complexity, and accuracy-based metrics fail to address the absence of a clear ground truth. To realistically and fairly model organ allocation, specifically kidney allocation, we begin by testing the medical knowledge of LLMs to determine whether they understand the clinical factors required to make sound allocation decisions. Building on this foundation, we design two tasks: (1) Choose-One and (2) Rank-All. In Choose-One, LLMs select a single candidate from a list of potential candidates to receive a kidney. In this scenario, we assess fairness across demographics using traditional fairness metrics, such as proportional parity. In Rank-All, LLMs rank all candidates waiting for a kidney, reflecting real-world allocation processes more closely, where an organ is passed down a ranked list until allocated. Our evaluation on three LLMs reveals a divergence between fairness metrics: while exposure-based metrics suggest equitable outcomes, probability-based metrics uncover systematic preferential sorting, where specific groups were clustered in upper-ranking tiers. Furthermore, we observe that demographic preferences are highly task-dependent, showing inverted trends between Choose-One and Rank-All tasks, even when considering the topmost rank. Overall, our results indicate that current LLMs can introduce inequalities in real-world allocation scenarios, underscoring the urgent need for rigorous fairness evaluation and human oversight before their use in high-stakes decision-making.

翻译：医疗机构正考虑将大型语言模型应用于高风险临床决策，如器官分配。在此类敏感应用场景中，评估公平性至关重要。然而，现有评估方法往往存在不足：基准测试过于简化，难以捕捉现实世界的复杂性；基于准确率的指标则无法解决缺乏明确标准答案的问题。为真实且公平地模拟器官分配（特别是肾脏分配），我们首先测试了大型语言模型的医学知识，以判断其是否理解制定合理分配决策所需的临床因素。在此基础上，我们设计了两项任务：（1）选择单一候选者与（2）全列表排序。在“选择单一候选者”任务中，大型语言模型需从潜在候选者列表中选定一人接受肾脏移植。在此场景下，我们采用比例平等等传统公平性指标评估跨人口统计群体的公平性。在“全列表排序”任务中，大型语言模型对所有等待肾脏移植的候选者进行排序，这更贴近现实世界的分配流程——器官将按排序列表依次匹配直至完成分配。通过对三种大型语言模型的评估，我们发现不同公平性指标之间存在分歧：基于曝光度的指标显示结果公平，而基于概率的指标却揭示了系统性的优先排序现象，即特定群体被集中排列在较高等级。此外，我们观察到人口统计偏好具有高度任务依赖性，即使在考虑最高排名时，“选择单一候选者”与“全列表排序”任务间也呈现相反的趋势。总体而言，我们的研究结果表明，当前大型语言模型可能在现实分配场景中引入不平等，这凸显了在其应用于高风险决策前进行严格公平性评估与人工监督的迫切必要性。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

大语言模型基准综述

专知会员服务

27+阅读 · 2025年8月22日

【博士论文】小型和大型模型的不确定性估计

专知会员服务

21+阅读 · 2025年7月11日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日