Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders

Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with CE by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall on new domains and the retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based approach, they require a prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment at scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity. We compute item embeddings offline by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high-quality approximation while requiring only a fraction of CE calls as compared to CUR-based methods, and allows for leveraging DE to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test time, the item embeddings remain fixed and retrieval occurs over rounds, alternating between a) estimating the test query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test query embedding for retrieving more items. Our k-NN search method improves recall by up to 5% (k=1) and 54% (k=100) over DE-based approaches. Additionally, our indexing approach achieves a speedup of up to 100x over CUR-based and 5x over DE distillation methods, while matching or improving k-NN search recall over baselines.

翻译：交叉编码器（CE）模型通过联合编码查询-条目对来计算相似度，在评估查询-条目相关性方面优于基于嵌入的模型（双编码器）。现有方法通过使用双编码器（DE）或CUR矩阵分解拟合的向量嵌入空间近似CE相似度，实现基于CE的k近邻搜索。基于DE的检索-重排序方法在新领域上召回率较低，且DE的检索过程与CE解耦。尽管基于CUR的方法可能比基于DE的方法更准确，但需要数量过多的CE调用以计算条目嵌入，因此难以在大规模部署中实用化。本文针对上述不足，提出基于稀疏矩阵分解的方法，该方法可高效计算潜在查询和条目嵌入以近似CE分数，并基于近似CE相似度实现k近邻搜索。我们通过分解包含训练查询-条目CE分数的稀疏矩阵，离线计算条目嵌入。与基于CUR的方法相比，本方法仅需少量CE调用即可获得高质量近似，并允许利用DE初始化嵌入空间，同时避免通过蒸馏对DE进行高计算与资源消耗的微调。在测试阶段，条目嵌入保持固定，检索通过多轮交替进行：a) 通过最小化已检索条目CE分数近似的误差来估计测试查询嵌入，b) 利用更新后的测试查询嵌入检索更多条目。与基于DE的方法相比，本k近邻搜索方法在k=1时召回率提升最高达5%，在k=100时提升最高达54%。此外，本索引方法的速度较基于CUR的方法提升最高达100倍，较DE蒸馏方法提升最高达5倍，同时匹配或超越基线方法的k近邻搜索召回率。