Medicine is inherently pluralistic. Principles such as autonomy, beneficence, nonmaleficence, and justice routinely conflict, and such ethical dilemmas often sharply divide reasonable physicians. Good clinical practice navigates these tensions in concert with each patient's values rather than imposing a single ethical stance. The ethical values that large language models bring to medical advice, however, have not been systematically examined. We present a framework for auditing value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities directly from decisions. The ecosystem of frontier models spans physician-level value heterogeneity, and models discuss competing values in their reasoning (Overton pluralism) before committing to a decision. However, individual model decisions are near-deterministic across repeated sampling and semantic variations, failing to reproduce the distributional pluralism of the physician panel. Across benchmark cases, these consistent decisions reflect committed, systematic value preferences. While most model priorities fall within the natural range of inter-physician variation, some significantly underweight patient autonomy. A single LLM deployed without regard for its value priorities could amplify those priorities at scale to every patient it serves. Without explicit efforts to balance ethical perspectives with one or multiple models, these tools risk replacing clinical pluralism with a deployment monoculture.
翻译:医学本质上是多元的。自主性、善行、无害与公正等原则常常相互冲突,这类伦理困境往往使理性医师间产生严重分歧。良好的临床实践应结合每位患者的价值观来协调这些张力,而非强加单一伦理立场。然而,大型语言模型在医疗建议中所体现的伦理价值观尚未得到系统检验。我们提出一个用于审计医疗AI中价值多元性的框架,包含经临床专家验证的困境基准集以及一种直接从决策中恢复价值优先级的归因方法。前沿模型生态系统覆盖了医师层面的价值异质性,模型在推理中会讨论相互竞争的价值(奥弗顿多元论)后再做出决策。然而,单个模型的决策在重复采样和语义变体下近乎确定,无法复现医师小组中的分布性多元性。在基准案例中,这些一致决策反映了固执的、系统性的价值偏好。虽然大多数模型的优先级落在医师间差异的自然范围内,但有些显著低估了患者自主性。若将单一LLM部署时不考虑其价值优先级,它可能将其优先级大规模强加给所服务的每位患者。若不以一个或多个模型明确平衡伦理视角,这些工具将面临以部署单一文化取代临床多元性的风险。