In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
翻译:近年来,大型语言模型(LLMs)在医疗应用中得到广泛使用,例如临床决策支持、医学教育和医疗问答。然而,这些模型通常以英语为中心,限制了其在语言多样化社区中的鲁棒性和可靠性。近期研究强调了在各类医疗任务中,低资源语言性能存在差异,但其根本原因仍不甚明了。本研究对LLMs在阿拉伯语和英语医疗问答任务中的表现进行了跨语言实证分析。我们的研究结果揭示了一种持续存在的语言驱动性能差距,且该差距随任务复杂度增加而加剧。词元化分析暴露了阿拉伯语医疗文本的结构性碎片化问题,而可靠性分析表明模型报告的置信度与解释与正确性的相关性有限。这些发现共同强调了在医疗任务中,LLMs需要采用语言感知的设计与评估策略。