Information retrieval (IR) is essential in biomedical knowledge acquisition and clinical decision support. While recent progress has shown that language model encoders perform better semantic retrieval, training such models requires abundant query-article annotations that are difficult to obtain in biomedicine. As a result, most biomedical IR systems only conduct lexical matching. In response, we introduce BioCPT, a first-of-its-kind Contrastively Pre-trained Transformer model for zero-shot biomedical IR. To train BioCPT, we collected an unprecedented scale of 255 million user click logs from PubMed. With such data, we use contrastive learning to train a pair of closely-integrated retriever and re-ranker. Experimental results show that BioCPT sets new state-of-the-art performance on five biomedical IR tasks, outperforming various baselines including much larger models such as GPT-3-sized cpt-text-XL. In addition, BioCPT also generates better biomedical article and sentence representations for semantic evaluations. As such, BioCPT can be readily applied to various real-world biomedical IR tasks. BioCPT API and code are publicly available at https://github.com/ncbi/BioCPT.
翻译:信息检索(IR)在生物医学知识获取与临床决策支持中至关重要。尽管近期进展表明语言模型编码器能实现更优的语义检索,但训练此类模型需要海量查询-文章标注数据,在生物医学领域获取这些数据极为困难。因此,大多数生物医学IR系统仅执行词汇匹配。为此,我们提出BioCPT——首个用于零样本生物医学IR的对比预训练变换器模型。为训练BioCPT,我们收集了PubMed中史无前例的2.55亿条用户点击日志。利用这些数据,我们通过对比学习训练了一组密切集成的检索器与重排序器。实验结果表明,BioCPT在五项生物医学IR任务上创下新最优性能,超越包括GPT-3规模的cpt-text-XL在内的多种基线模型。此外,BioCPT还能为语义评估生成更优的生物医学文章与句子表征。因此,BioCPT可便捷地应用于各类真实生物医学IR任务。BioCPT的API与代码已公开于https://github.com/ncbi/BioCPT。