Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that models such as BERT are not ``structurally ready'' to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This ``lack of readiness'' results from the gap between language model pre-training and DPR fine-tuning. Previous solutions call for computationally expensive techniques such as hard negative mining, cross-encoder distillation, and further pre-training to learn a robust DPR model. In this work, we instead propose to fully exploit knowledge in a pre-trained language model for DPR by aggregating the contextualized token embeddings into a dense vector, which we call agg*. By concatenating vectors from the [CLS] token and agg*, our Aggretriever model substantially improves the effectiveness of dense retrieval models on both in-domain and zero-shot evaluations without introducing substantial training overhead. Code is available at https://github.com/castorini/dhr
翻译:预训练语言模型在许多知识密集型自然语言处理任务中已取得成功。然而,近期研究显示,像BERT这类模型在“结构上尚未准备好”,无法将文本信息聚合为用于密集段落检索(DPR)的[CLS]向量。这种“准备不足”源于语言模型预训练与DPR微调之间的差距。先前的解决方案需要引入计算成本高昂的技术,例如困难负样本挖掘、交叉编码器蒸馏,以及进一步预训练,以学习稳健的DPR模型。在本工作中,我们转而提出通过将上下文化的词元嵌入聚合为稠密向量(我们称之为agg*),来充分挖掘预训练语言模型在DPR中的知识。通过拼接来自[CLS]标记和agg*的向量,我们的Aggretriever模型在域内评估和零样本评估中均显著提升了密集检索模型的效果,且未引入大量训练开销。代码已开源至https://github.com/castorini/dhr。