Large-scale retrieval is to recall relevant documents from a huge collection given a query. It relies on representation learning to embed documents and queries into a common semantic encoding space. According to the encoding space, recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. These two paradigms unveil the PLMs' representation capability in different granularities, i.e., global sequence-level compression and local word-level contexts, respectively. Inspired by their complementary global-local contextualization and distinct representing views, we propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability. Experiments on passage retrieval benchmarks verify its effectiveness in both paradigms. A uni-retrieval scheme is further presented with even better retrieval quality. We lastly evaluate the model on BEIR benchmark to verify its transferability.
翻译:摘要:大规模检索是指根据查询从海量数据集中召回相关文档的任务。该任务依赖于表征学习,将文档和查询嵌入到共同的语义编码空间中。根据编码空间,基于预训练语言模型(PLM)的最新检索方法可大致分为密集向量范式与基于词汇的范式。这两种范式分别从全局序列级压缩与局部词级上下文的粒度揭示了PLM的表征能力。受其全局与局部语境化的互补性及不同表征视角的启发,我们提出了一种新的学习框架UnifieR,该框架通过双重表征能力在同一模型中统一了密集向量检索与基于词汇的检索。在段落检索基准上的实验验证了该框架在两种范式下的有效性。进一步提出的统一检索方案展现了更优的检索质量。最后,我们在BEIR基准上评估了模型的可迁移性。