Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems.This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XlingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XlingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all.

翻译：大型语言模型正在改变公众获取和消费信息的方式。在医疗等重要领域，其影响尤为显著，普通用户越来越多地将其作为日常问答的对话代理。尽管大型语言模型展现出令人印象深刻的语言理解和生成能力，但在这些高风险领域，其安全性仍是首要关切。此外，大型语言模型的开发严重偏向英语，这些模型在非英语语境中的表现尚不明确，这一差距对于确保这些系统在实际使用中的公平性至关重要。本文提出了一个框架，用于研究大型语言模型作为多语言对话系统在医疗查询中的有效性。我们基于实证得出的框架XlingEval聚焦于评估大型语言模型对自然人类撰写的健康相关问题回答的三个基本标准：正确性、一致性和可验证性。通过对包括英语、西班牙语、中文和印地语在内的四种主要全球语言进行大量实验，跨越三个专家标注的大型健康问答数据集，并结合算法与人类评估策略，我们发现大型语言模型在这些语言中的回答存在显著差异，表明需要增强其跨语言能力。我们进一步提出XlingHealth，一个用于检验大型语言模型在医疗语境中多语言能力的跨语言基准。我们的研究结果强调了提升这些模型跨语言能力的紧迫性，并呼吁构建一个人人可及的公平信息生态。