Large Language Models (LLMs) with reasoning capabilities have recently demonstrated strong potential in medical Question Answering (QA). Existing approaches are largely English-focused and primarily rely on distillation from general-purpose LLMs, raising concerns about the reliability of their medical knowledge. In this work, we present a method to generate multilingual reasoning traces based on medical knowledge extracted from Wikipedia. We produce 500k traces in English, Italian, and Spanish, using a retrieval-augmented generation approach over medical information from Wikipedia. The traces are generated to solve medical questions drawn from MedQA and MedMCQA, which we extend to Italian and Spanish. We test our pipeline in both in-domain and out-of-domain settings across Medical QA benchmarks, and demonstrate that our reasoning traces improve performance both when utilized via in-context learning (few-shot) and supervised fine-tuning, yielding state-of-the-art results among 8B-parameter LLMs. We believe that these resources can support the development of more transparent clinical decision-support tools in multilingual settings. We release the full suite of resources: reasoning traces, translated QA datasets, Medical-Wikipedia, and fine-tuned models.
翻译:近期,具备推理能力的大型语言模型(LLMs)在医学问答(QA)领域展现出巨大潜力。现有方法主要聚焦于英语场景,且大多依赖通用型LLMs的知识蒸馏,其医学知识的可靠性存疑。本研究提出一种基于维基百科医学知识生成多语言推理轨迹的方法。我们采用检索增强生成技术,从维基百科医学信息中生成50万条英语、意大利语和西班牙语的推理轨迹。这些轨迹用于解答源自MedQA和MedMCQA的医学问题,并将后两者扩展至意大利语和西班牙语版本。我们在医学问答基准测试的领域内和领域外场景中验证了该流程,结果表明:通过上下文学习(少样本)和监督微调两种方式,推理轨迹均能提升模型性能,在8B参数规模LLMs中达到最优水平。我们相信,这些资源将助力开发更透明的多语言临床决策支持工具。现完整发布全部资源:推理轨迹、翻译问答数据集、医学维基百科语料及微调模型。