While dense retrieval models, which embed queries and documents into a shared low-dimensional space, have gained widespread popularity, they were shown to exhibit important theoretical limitations and considerably lag behind traditional sparse retrieval models in certain settings. Generative retrieval has emerged as an alternative approach to dense retrieval by using a language model to predict query-document relevance directly. In this paper, we demonstrate strengths and weaknesses of generative retrieval approaches using a simple synthetic dataset, called LIMIT, that was previously introduced to empirically demonstrate the theoretical limitations of embedding-based retrieval but was not used to evaluate generative retrieval. We close this research gap and show that generative retrieval achieves the best performance on this dataset without any additional training required (0.92 and 0.99 R@2 for SEAL and MINDER, respectively), compared to dense approaches (< 0.03 Recall@2) and BM25 (0.86 R@2). However, we then proceed to extend the original LIMIT dataset by adding simple hard negative samples and observe the performance degrading for all the models including the generative retrieval models (0.51 R@2) as well as BM25 (0.21 R@2). Error analysis identifies a failure in the decoding mechanism, caused by the inability to produce identifiers that are unique to relevant documents. Future generative retrieval must address these issues, either by designing identifiers that are more suitable to the decoding process or by adapting decoding and scoring algorithms to preserve relevance signals.
翻译:尽管稠密检索模型通过将查询和文档嵌入共享低维空间而广受欢迎,但研究表明其存在重要理论局限,在某些场景下显著落后于传统稀疏检索模型。生成式检索通过利用语言模型直接预测查询-文档相关性,已发展为稠密检索的替代方案。本文利用名为LIMIT的简单合成数据集,系统展示了生成式检索方法的优劣特性——该数据集先前用于实证检验基于嵌入检索的理论局限性,但尚未被用于评估生成式检索。我们填补了这一研究空白,证明生成式检索在该数据集上无需额外训练即可达到最优性能(SEAL和MINDER的R@2分别为0.92和0.99),而稠密方法(Recall@2<0.03)与BM25(R@2=0.86)均显著逊色。然而,通过在原LIMIT数据集中添加简单难负样本后,所有模型性能均出现下降:生成式检索模型(R@2=0.51)与BM25(R@2=0.21)均受影响。错误分析表明解码机制存在缺陷,其根源在于无法生成与相关文档唯一对应的标识符。未来生成式检索必须解决这些问题,可通过设计更适配解码过程的标识符,或调整解码与评分算法以保留相关性信号。