How much does context affect the accuracy of AI health advice?

Large language models (LLMs) are increasingly used to provide health advice, yet evidence on how their accuracy varies across languages, topics and information sources remains limited. We assess how linguistic and contextual factors affect the accuracy of AI-based health-claim verification. We evaluated seven widely used LLMs on two datasets: (i) 1,975 legally authorised nutrition and health claims from UK and EU regulatory registers translated into 21 languages; and (ii) 9,088 journalist-vetted public-health claims from the PUBHEALTH corpus spanning COVID-19, abortion, politics and general health, drawn from government advisories, scientific abstracts and media sources. Models classified each claim as supported or unsupported using majority voting across repeated runs. Accuracy was analysed by language, topic, source and model. Accuracy on authorised claims was highest in English and closely related European languages and declined in several widely spoken non-European languages, decreasing with syntactic distance from English. On real-world public-health claims, accuracy was substantially lower and varied systematically by topic and source. Models performed best on COVID-19 and government-attributed claims and worst on general health and scientific abstracts. High performance on English, canonical health claims masks substantial context-dependent gaps. Differences in training data exposure, editorial framing and topic-specific tuning likely contribute to these disparities, which are comparable in magnitude to cross-language differences. LLM accuracy in health-claim verification depends strongly on language, topic and information source. English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.

翻译：大型语言模型（LLM）越来越多地被用于提供健康建议，但其准确性在不同语言、主题和信息来源间的差异仍缺乏充分证据。本研究评估了语言和语境因素如何影响基于人工智能的健康声明验证准确性。我们在两个数据集上评估了七个广泛使用的LLM：（i）从英国和欧盟监管机构注册的1,975条合法授权的营养与健康声明，翻译为21种语言；（ii）来自PUBHEALTH语料库的9,088条经记者核实的公共卫生声明，涵盖COVID-19、堕胎、政治和一般健康主题，来源包括政府建议、科学摘要和媒体资料。模型通过多次运行中的多数投票方式将每条声明分类为"有依据"或"无依据"。我们按语言、主题、来源和模型对准确性进行了分析。在授权声明上，英语及密切相关的欧洲语言准确率最高，而在几种广泛使用的非欧洲语言中准确率下降，且随句法距离英语的远近而递减。在现实世界的公共卫生声明上，准确率显著降低，并随主题和来源呈现系统性差异。模型在COVID-19和政府来源声明上表现最佳，在一般健康主题和科学摘要上表现最差。英语规范健康声明的高性能掩盖了显著的语境依赖性缺陷。训练数据暴露差异、编辑框架和特定主题调优可能是造成这些差异的原因，其影响程度与跨语言差异相当。LLM在健康声明验证中的准确性高度依赖于语言、主题和信息来源。英语环境下的性能无法可靠推广至其他语境，这凸显了在公共卫生传播中部署前进行多语言、领域特异性评估的必要性。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日

大型语言模型疾病诊断综述

专知会员服务

32+阅读 · 2024年9月21日

迈向可信的人工智能：伦理和稳健的大型语言模型综述

专知会员服务

39+阅读 · 2024年7月28日