Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval

Approximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance'' -- ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content. We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples $S_i$ embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure. On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall.

翻译：分布性近似最近邻搜索索引构成了实际推荐系统的基础，支持对百万级商品目录进行实时候选检索。通常，每个用户和每个项目都学习单个点估计嵌入。在服务时，用户嵌入查询索引以获取相关项目。由于这些表示从稀疏的交互数据中学习而来，它们存在噪声，可能无法捕捉所有构成“相关性”的细微差别——忽视了其固有的根本不确定性。其结果是检索流水线系统性地偏向于那些具有良好估计嵌入的少数热门头部项目，而牺牲了长尾中大部分小众、多样化和意外发现的内容。我们提出了DINOSAUR（面向不确定性感知检索的分布性近似最近邻搜索）：一个简单且与基础设施兼容的框架，用于将嵌入不确定性融入候选生成。DINOSAUR不索引点估计，而是对每个项目采样$S_i$个嵌入，并在此增强集上构建索引。类似地，在查询时，对用户嵌入进行采样。这种双边随机检索过程隐式地边缘化了嵌入不确定性，无需更改模型架构或ANN索引基础设施。在分析方面，我们表明DINOSAUR在不确定性消失时能恢复标准的点估计检索，并刻画了增加的嵌入方差如何扩展潜在空间中不确定项目可被检索的区域。可重现的实证观察与这些预期一致，显示以离线召回率的小幅损失为代价，覆盖范围大幅提升。