Generative Retrieval (GR), autoregressively decoding relevant document identifiers given a query, has been shown to perform well under the setting of small-scale corpora. By memorizing the document corpus with model parameters, GR implicitly achieves deep interaction between query and document. However, such a memorizing mechanism faces three drawbacks: (1) Poor memory accuracy for fine-grained features of documents; (2) Memory confusion gets worse as the corpus size increases; (3) Huge memory update costs for new documents. To alleviate these problems, we propose the Generative Dense Retrieval (GDR) paradigm. Specifically, GDR first uses the limited memory volume to achieve inter-cluster matching from query to relevant document clusters. Memorizing-free matching mechanism from Dense Retrieval (DR) is then introduced to conduct fine-grained intra-cluster matching from clusters to relevant documents. The coarse-to-fine process maximizes the advantages of GR's deep interaction and DR's scalability. Besides, we design a cluster identifier constructing strategy to facilitate corpus memory and a cluster-adaptive negative sampling strategy to enhance the intra-cluster mapping ability. Empirical results show that GDR obtains an average of 3.0 R@100 improvement on NQ dataset under multiple settings and has better scalability.
翻译:生成式检索(GR)通过自回归方式解码与查询相关的文档标识符,在小型语料库场景下表现优异。通过用模型参数记忆文档语料库,GR隐式实现了查询与文档间的深度交互。然而,这种记忆机制面临三个缺陷:(1)对文档细粒度特征的记忆精度不足;(2)随语料库规模增大,记忆混淆加剧;(3)新文档的记忆更新成本过高。为缓解这些问题,我们提出生成式密集检索(GDR)范式。具体而言,GDR首先利用有限记忆容量实现从查询到相关文档聚类的跨簇匹配,随后引入密集检索(DR)的无记忆匹配机制进行从聚类到相关文档的细粒度簇内匹配。这种由粗到精的处理过程最大化了GR深度交互与DR可扩展性的优势。此外,我们设计了一种聚类标识符构建策略以促进语料记忆,并提出一种簇自适应负采样策略以增强簇内映射能力。实验结果表明,GDR在NQ数据集的多项设置下平均获得3.0 R@100提升,并展现出更强的可扩展性。