While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at https://cure-med.github.io/
翻译:尽管大型语言模型(LLM)在单语言的数学和常识推理任务中表现良好,但在多语言医学推理应用中仍不可靠,这阻碍了其在多语言医疗环境中的部署。为解决这一问题,我们首先引入了CUREMED-BENCH,这是一个高质量的多语言医学推理数据集,包含具有单一可验证答案的开放式推理查询,涵盖十三种语言,包括阿姆哈拉语、约鲁巴语和斯瓦希里语等代表性不足的语言。基于此数据集,我们提出了CURE-MED,一种基于课程学习的强化学习框架,该框架整合了代码切换感知的监督微调和组相对策略优化,以共同提升逻辑正确性和语言稳定性。在十三种语言中,我们的方法始终优于强基线模型,并展现出良好的扩展性:在70亿参数规模下实现了85.21%的语言一致性和54.35%的逻辑正确性,在320亿参数规模下达到了94.96%的语言一致性和70.04%的逻辑正确性。这些结果支持了LLM实现可靠且公平的多语言医学推理。代码和数据集可在 https://cure-med.github.io/ 获取。