In this paper, we propose SimLM (Similarity matching with Language Model pre-training), a simple yet effective pre-training method for dense passage retrieval. It employs a simple bottleneck architecture that learns to compress the passage information into a dense vector through self-supervised pre-training. We use a replaced language modeling objective, which is inspired by ELECTRA, to improve the sample efficiency and reduce the mismatch of the input distribution between pre-training and fine-tuning. SimLM only requires access to unlabeled corpus, and is more broadly applicable when there are no labeled data or queries. We conduct experiments on several large-scale passage retrieval datasets, and show substantial improvements over strong baselines under various settings. Remarkably, SimLM even outperforms multi-vector approaches such as ColBERTv2 which incurs significantly more storage cost. Our code and model check points are available at https://github.com/microsoft/unilm/tree/master/simlm .
翻译:本文提出SimLM(基于语言模型预训练的相似度匹配),一种简单而有效的密集段落检索预训练方法。该方法采用简洁的瓶颈架构,通过自监督预训练学习将段落信息压缩为密集向量。我们借鉴ELECTRA的思想,使用替换语言建模目标,以提高样本效率并减少预训练与微调之间输入分布的不匹配。SimLM仅需无标注语料库,在缺乏标注数据或查询时具有更广泛的适用性。我们在多个大规模段落检索数据集上进行了实验,结果表明,在各种设置下,SimLM相较于强基线方法均有显著提升。值得注意的是,SimLM甚至优于ColBERTv2等多向量方法,而后者会带来更高的存储成本。我们的代码和模型检查点已在https://github.com/microsoft/unilm/tree/master/simlm 上公开。