Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges. In this paper, we introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on a benchmark comprising 10 established medical question-answering (QA) tasks in English. We also explore lightweight models obtained through quantization and model merging approaches. Our results demonstrate BioMistral's superior performance compared to existing open-source medical models and its competitive edge against proprietary counterparts. Finally, to address the limited availability of data beyond English and to assess the multilingual generalization of medical LLMs, we automatically translated and evaluated this benchmark into 7 other languages. This marks the first large-scale multilingual evaluation of LLMs in the medical domain. Datasets, multilingual evaluation benchmarks, scripts, and all the models obtained during our experiments are freely released.
翻译:近年来,大语言模型(LLMs)展现出卓越的通用性,在医疗健康等专业领域具有潜在的应用前景。尽管已有多种为健康领域定制的开源LLMs,但将通用LLMs适配到医学领域仍面临重大挑战。本文介绍了BioMistral,一个专为生物医学领域定制的开源LLM。该模型以Mistral为基础模型,并在PubMed Central上进行了进一步的预训练。我们在一个包含10项成熟英文医学问答(QA)任务的基准测试上对BioMistral进行了全面评估。同时,我们还探索了通过量化和模型合并方法获得的轻量化模型。实验结果表明,与现有的开源医学模型相比,BioMistral具有更优的性能,并与专有模型相比具备竞争优势。最后,针对非英语数据稀缺的问题,并为了评估医学LLMs的多语言泛化能力,我们将此基准测试自动翻译并评估了其他7种语言版本。这标志着医学领域首次大规模的多语言LLMs评估。我们在实验中使用的数据集、多语言评估基准、脚本以及获得的所有模型均已公开发布。