Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.
翻译:大语言模型(LLMs)正越来越多地应用于医疗健康领域,但其与临床标准的沟通校准程度尚未被充分量化。我们对通用型与领域专用型LLMs在结构化医学解释及真实医患互动场景中进行了多维度评估,分析了语义保真度、可读性及情感共鸣。基础模型相较于医生会放大情感极性(极负面:43.14%-45.10% vs. 37.25%),且在GPT-5和Claude等更大规模架构中产生了显著更高的语言复杂度(FKGL分数高达16.91-17.60,而医生撰写的回复为11.47-12.50)。共情导向提示能降低极端负性表达并减少年级水平复杂度(GPT-5最多降低6.87个FKGL分数),但未能显著提升语义保真度。协作式改写实现了最强的整体校准效果。改写配置在与医生答案的语义相似度上达到最高(均值高达0.93),同时持续改善可读性并降低情感极端性。双重利益相关方评估显示,没有模型在认知标准上超越医生,而患者则因清晰度和情感基调持续偏好改写版本。这些发现表明,LLMs作为协作式沟通增强工具比替代临床专业知识更为有效。