Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{https://huggingface.co/BMRetriever} to ensure transparency, reproducibility, and application to new domains.
翻译:开发有效的生物医学检索模型对于完成知识密集型的生物医学任务至关重要,但由于缺乏足够公开标注的生物医学数据和计算资源,这一直具有挑战性。我们提出BMRetriever系列密集检索器,通过在大型生物医学语料库上进行无监督预训练,随后结合标注数据集和合成配对进行指令微调,以增强生物医学检索能力。在涵盖11个数据集的5项生物医学任务上的实验验证了BMRetriever在多种生物医学应用中的有效性。BMRetriever还展现出强大的参数效率,其410M参数变体在性能上超越了规模高达11.7倍的基线模型,而2B参数变体则达到了超过5B参数模型的表现。训练数据和模型检查点已发布于\url{https://huggingface.co/BMRetriever},以确保透明度、可复现性及在新领域的应用。