Background. Biomedical language models should improve performance on biomedical text while retaining general-language-modeling fluency. For Mamba-based models, this trade-off has not been systematically studied across biomedical literature and clinical text. Methods. We developed BioMamba, a family of biomedical Mamba2 models at five scales obtained by continued pretraining of released public Mamba2 checkpoints on a balanced 80%/10%/10% mixture of PubMed abstracts, the Colossal Clean Crawled Corpus (C4), and Wikipedia. The contribution is the adaptation recipe and the accompanying open-weight checkpoints. Results. Across five scales, BioMamba consistently lowered PubMed perplexity, improved Wikipedia-style held-out perplexity by 1.46-4.72 PPL, and left C4 perplexity essentially unchanged. On six out-of-domain multiple-choice benchmarks, BioMamba stayed within +/-3 percentage points of Mamba2 with no systematic regression. After supervised fine-tuning, BioMamba+SFT matched or exceeded Mamba2+SFT on MIMIC-IV note completion and discharge summary generation at every evaluated scale, and improved PubMedQA at every scale. The strongest model (BioMamba-2.7B) reached a PubMed perplexity of 5.28 and accuracies of 90.24% and 73.00% on BioASQ and PubMedQA, respectively. Conclusions. A balanced domain-adaptive continued pretraining recipe strengthens Mamba2 language models on biomedical literature and clinical text while preserving general-language-modeling fluency.
翻译:背景。生物医学语言模型应在提升生物医学文本处理性能的同时,保持通用语言建模的流畅性。对于基于Mamba的模型而言,这一平衡性在生物医学文献与临床文本中尚未得到系统性研究。方法。我们开发了BioMamba系列模型,这是包含五个规模的Mamba2语言模型家族,通过在PubMed摘要、Colossal Clean Crawled Corpus (C4) 和维基百科按80%/10%/10%比例混合的语料库上对已发布的开源Mamba2检查点进行持续预训练而获得。主要贡献在于领域自适应方案及其配套的开源权重检查点。结果。在五个规模上,BioMamba均持续降低了PubMed困惑度,将维基百科风格的保留集困惑度提升了1.46-4.72 PPL,同时C4困惑度基本保持不变。在六个域外多项选择基准测试中,BioMamba的性能波动在Mamba2的±3个百分点以内,未出现系统性退化。经过监督微调后,BioMamba+SFT在MIMIC-IV笔记补全和出院小结生成任务中,于每个评估规模下均达到或超过Mamba2+SFT的水平,并在PubMedQA上各规模均有所提升。最强模型(BioMamba-2.7B)在PubMed上达到5.28的困惑度,并在BioASQ和PubMedQA上分别获得90.24%和73.00%的准确率。结论。平衡的领域自适应持续预训练方案能够强化Mamba2语言模型在生物医学文献与临床文本上的表现,同时保持通用语言建模的流畅性。