Dense retrieval models usually adopt vectors from the last hidden layer of the document encoder to represent a document, which is in contrast to the fact that representations in different layers of a pre-trained language model usually contain different kinds of linguistic knowledge, and behave differently during fine-tuning. Therefore, we propose to investigate utilizing representations from multiple encoder layers to make up the representation of a document, which we denote Multi-layer Representations (MLR). We first investigate how representations in different layers affect MLR's performance under the multi-vector retrieval setting, and then propose to leverage pooling strategies to reduce multi-vector models to single-vector ones to improve retrieval efficiency. Experiments demonstrate the effectiveness of MLR over dual encoder, ME-BERT and ColBERT in the single-vector retrieval setting, as well as demonstrate that it works well with other advanced training techniques such as retrieval-oriented pre-training and hard negative mining.
翻译:密集检索模型通常采用文档编码器最后一隐藏层的向量来表示文档,这与预训练语言模型中不同层的表征通常包含不同类型的语言学知识、且在微调过程中表现各异的事实形成对比。因此,我们提出研究利用编码器多个层的表征来构建文档表示,并将其称为多层表征(MLR)。我们首先探究了在多向量检索设置下,不同层的表征如何影响MLR的性能,随后提出利用池化策略将多向量模型简化为单向量模型以提升检索效率。实验表明,在单向量检索设置下,MLR相较于双编码器、ME-BERT和ColBERT具有显著优势,同时验证了其与检索导向预训练和困难负样本挖掘等先进训练技术具有良好的兼容性。