Aggretriever: A Simple Approach to Aggregate Textual Representations for Robust Dense Passage Retrieval

Pre-trained language models have been successful in many knowledge-intensive NLP tasks. However, recent work has shown that models such as BERT are not ``structurally ready'' to aggregate textual information into a [CLS] vector for dense passage retrieval (DPR). This ``lack of readiness'' results from the gap between language model pre-training and DPR fine-tuning. Previous solutions call for computationally expensive techniques such as hard negative mining, cross-encoder distillation, and further pre-training to learn a robust DPR model. In this work, we instead propose to fully exploit knowledge in a pre-trained language model for DPR by aggregating the contextualized token embeddings into a dense vector, which we call agg*. By concatenating vectors from the [CLS] token and agg*, our Aggretriever model substantially improves the effectiveness of dense retrieval models on both in-domain and zero-shot evaluations without introducing substantial training overhead. Code is available at https://github.com/castorini/dhr

翻译：预训练语言模型在许多知识密集型自然语言处理任务中已取得成功。然而，近期研究显示，像BERT这类模型在“结构上尚未准备好”，无法将文本信息聚合为用于密集段落检索（DPR）的[CLS]向量。这种“准备不足”源于语言模型预训练与DPR微调之间的差距。先前的解决方案需要引入计算成本高昂的技术，例如困难负样本挖掘、交叉编码器蒸馏，以及进一步预训练，以学习稳健的DPR模型。在本工作中，我们转而提出通过将上下文化的词元嵌入聚合为稠密向量（我们称之为agg*），来充分挖掘预训练语言模型在DPR中的知识。通过拼接来自[CLS]标记和agg*的向量，我们的Aggretriever模型在域内评估和零样本评估中均显著提升了密集检索模型的效果，且未引入大量训练开销。代码已开源至https://github.com/castorini/dhr。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

自然语言处理顶会NAACL2022最佳论文出炉！

专知会员服务

43+阅读 · 2022年6月30日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

近期必读的七篇AAAI 2021【问答（QA）】相关论文和代码

专知会员服务

55+阅读 · 2021年2月2日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日