Large language models (LLMs) are increasingly used to provide health advice, yet evidence on how their accuracy varies across languages, topics and information sources remains limited. We assess how linguistic and contextual factors affect the accuracy of AI-based health-claim verification. We evaluated seven widely used LLMs on two datasets: (i) 1,975 legally authorised nutrition and health claims from UK and EU regulatory registers translated into 21 languages; and (ii) 9,088 journalist-vetted public-health claims from the PUBHEALTH corpus spanning COVID-19, abortion, politics and general health, drawn from government advisories, scientific abstracts and media sources. Models classified each claim as supported or unsupported using majority voting across repeated runs. Accuracy was analysed by language, topic, source and model. Accuracy on authorised claims was highest in English and closely related European languages and declined in several widely spoken non-European languages, decreasing with syntactic distance from English. On real-world public-health claims, accuracy was substantially lower and varied systematically by topic and source. Models performed best on COVID-19 and government-attributed claims and worst on general health and scientific abstracts. High performance on English, canonical health claims masks substantial context-dependent gaps. Differences in training data exposure, editorial framing and topic-specific tuning likely contribute to these disparities, which are comparable in magnitude to cross-language differences. LLM accuracy in health-claim verification depends strongly on language, topic and information source. English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
翻译:大型语言模型(LLM)越来越多地被用于提供健康建议,但其准确性在不同语言、主题和信息来源间的差异仍缺乏充分证据。本研究评估了语言和语境因素如何影响基于人工智能的健康声明验证准确性。我们在两个数据集上评估了七个广泛使用的LLM:(i)从英国和欧盟监管机构注册的1,975条合法授权的营养与健康声明,翻译为21种语言;(ii)来自PUBHEALTH语料库的9,088条经记者核实的公共卫生声明,涵盖COVID-19、堕胎、政治和一般健康主题,来源包括政府建议、科学摘要和媒体资料。模型通过多次运行中的多数投票方式将每条声明分类为"有依据"或"无依据"。我们按语言、主题、来源和模型对准确性进行了分析。在授权声明上,英语及密切相关的欧洲语言准确率最高,而在几种广泛使用的非欧洲语言中准确率下降,且随句法距离英语的远近而递减。在现实世界的公共卫生声明上,准确率显著降低,并随主题和来源呈现系统性差异。模型在COVID-19和政府来源声明上表现最佳,在一般健康主题和科学摘要上表现最差。英语规范健康声明的高性能掩盖了显著的语境依赖性缺陷。训练数据暴露差异、编辑框架和特定主题调优可能是造成这些差异的原因,其影响程度与跨语言差异相当。LLM在健康声明验证中的准确性高度依赖于语言、主题和信息来源。英语环境下的性能无法可靠推广至其他语境,这凸显了在公共卫生传播中部署前进行多语言、领域特异性评估的必要性。