Random-Access Ranked Retrieval and Similarity Search

We extend Random Access, a fundamental operation that enables efficient search and exploration algorithms, to the modern interactive data systems based on Ranked Retrieval and Similarity Search, where orderings are dynamically defined over a high-dimensional feature space. This extension enables efficient solutions for a wide range of applications, from data analytics tools and database systems to recommendation systems and machine learning. We formalize the Random-Access Ranked Retrieval (RAR) problem, and extend it to Similarity Search. Our algorithmic innovations include the development of a theoretically efficient algorithm based on geometric arrangements, achieving logarithmic query time. However, this method suffers from exponential space complexity in high dimensions. Therefore, we develop a second class of algorithms based on $\varepsilon$-sampling, which consume a linear space. Since exactly locating the tuple at a specific rank is challenging due to its connection to the range counting problem, we introduce a relaxed variant called $κ$-Random-Access Ranked Retrieval, which returns a small subset of size $κ$ guaranteed to contain the target tuple. To solve this problem efficiently, we define an intermediate problem, Stripe Range Retrieval (SRR), and design a hierarchical sampling data structure tailored for narrow stripe range queries. Our method achieves practical scalability in both data size and dimensionality. We prove near-optimal bounds on the efficiency of our algorithms and validate their performance through extensive experiments on real and synthetic datasets, demonstrating scalability to millions of tuples and hundreds of dimensions.

翻译：我们将随机访问这一支持高效搜索和探索算法的基本操作，扩展至基于排序检索和相似性搜索的现代交互式数据系统——其中排序关系在高维特征空间中动态定义。这一扩展为数据解析工具、数据库系统、推荐系统及机器学习等广泛应用领域提供了高效解决方案。我们形式化定义了随机访问排序检索问题，并将其扩展至相似性搜索场景。算法创新方面，我们基于几何排列提出了理论高效的算法，实现了对数级查询时间。然而该方法在高维场景下面临指数级空间复杂度的困境，因此我们开发了基于$\varepsilon$-采样的第二类算法，其空间复杂度为线性。由于精确定位特定排名的元组与范围计数问题存在内在关联而极具挑战性，我们引入名为$κ$-随机访问排序检索的松弛变体——该算法返回大小为$κ$的候选子集，并保证目标元组包含其中。为高效求解该问题，我们定义了中间问题"带状范围检索"，并设计了面向窄带状范围查询的分层采样数据结构。该方法在数据规模和维度扩展性上均具备实践可行性。我们证明了算法效率的近最优边界，并通过真实与合成数据集上的大量实验验证了其性能——表明该方法可扩展至百万级元组及数百维数据。