NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context

While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses' professional judgments, potentially intensifying nurse-patient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurse-patient conflicts. The Easy-Level dataset consists of 2,200 value-aligned and value-violating instances, which are collected through a five-month longitudinal field study across three hospitals of varying tiers; The Hard-Level dataset is comprised of 2,200 dialogue-based instances that embed contextual cues and subtle misleading signals, which increase adversarial complexity and better reflect the subjectivity and bias of narrators in the context of emerging nurse-patient conflicts. We evaluate a total of 23 SoTA LLMs on their ability to align with nursing values, and find that general LLMs outperform medical ones, and Justice is the hardest value dimension. As the first real-world benchmark for healthcare value alignment, NurValues provides novel insights into how LLMs navigate ethical challenges in clinician-patient interactions.

翻译：尽管大语言模型已展现出医学知识和对话能力，但其在临床实践中的部署引发了新的风险：患者可能对大语言模型生成的回答比对护士的专业判断抱有更高信任度，这可能加剧护患冲突。此类风险凸显了评估大语言模型是否符合人类护士所秉持的核心护理价值观的迫切需求。本研究提出了首个护理价值观对齐基准，包含从国际护理准则中提炼的五个核心价值维度：利他主义、人类尊严、正直、公正与专业精神。基于新兴护患冲突的两大特征，我们在基准上定义了两个层级任务：简易级数据集包含2,200个价值对齐与价值违背实例，通过为期五个月、横跨三个不同等级医院的纵向实地研究收集；困难级数据集由2,200个基于对话的实例构成，这些实例嵌入了情境线索和微妙的误导信号，增加了对抗复杂性，更能反映新兴护患冲突背景下叙述者的主观性与偏见。我们评估了总计23个前沿大语言模型在护理价值观对齐方面的能力，发现通用大语言模型表现优于医疗专用模型，且公正是最具挑战性的价值维度。作为医疗健康领域首个真实世界的价值观对齐基准，NurValues为大语言模型如何应对临床医患互动中的伦理挑战提供了新的见解。