Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).
翻译:大语言模型(LLMs)正越来越多地被用于提供面向公众的健康信息,然而现有的安全性评估忽略了模型在面对同一问题的不同用户表述时,是否保留了可比的医疗信息。针对这一问题,我们提出了医疗信息回复审核(MIRA)——一个双语受控基准,旨在评估LLM在用户端语言、语域和健康素养信号下是否提供可比的医疗信息。MIRA包含从60个经医学审核的低风险健康问题构建的4320条提示。在五个主流LLM中,模型回答了所有医疗问题,但对低健康素养信号的回复始终会遗漏更多关键信息,提供的具体后续步骤更少,并且对独立判断的支持也更弱。我们将这种模式称为差异信息稀释(DID)。语言效应因模型而异,而非对非英语提示普遍更差。与300条真实世界健康查询的对比初步提供了排序效度的证据。一种知识引导的缓解提示可减少大多数模型的信息稀释,其中Claude(约8%)和Qwen(约6%)在提供信息不足的简化方面改善最大。