Group Testing for Accurate and Efficient Range-Based Near Neighbor Search : An Adaptive Binary Splitting Approach

This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. The proposed method detects high-similarity vectors from an extensive collection of high dimensional vectors, where each vector represents an image descriptor. Our method efficiently marks each item in the collection as neighbor or non-neighbor on the basis of a cosine distance threshold without exhaustive search. Like other methods in the domain of large scale retrieval, our approach exploits the assumption that most of the items in the collection are unrelated to the query. Unlike other methods, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following the procedure of binary splitting, a multi-stage adaptive group testing algorithm, we split the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We experimentally show that our method achieves a speed-up over exhaustive search by a factor of more than ten with an accuracy same as that of exhaustive search, on a variety of large datasets. We present a theoretical analysis of the expected number of distance computations per query and the probability that a pool with a certain number of members will be pruned. In this way, our method exploits very useful and practical distributional properties unlike other methods. In our method, all required data structures are created purely offline. Moreover, our method does not impose any strong assumptions on the number of true near neighbors, is adaptible to streaming settings where new vectors are dynamically added to the database, and does not require any parameter tuning.

翻译：本文提出了一种自适应组测试框架，用于解决基于范围的高维近邻搜索问题。该方法能从大规模高维向量集合中检测出高相似度向量，其中每个向量代表一个图像描述符。我们的方法无需穷举搜索，即可基于余弦距离阈值高效地将集合中的每个项标记为近邻或非近邻。与大规模检索领域的其他方法类似，本方法利用了集合中大部分项与查询无关的假设，但不同于其他方法，它不要求查询向量与最不相关近邻及最不相关非近邻之间的余弦相似度存在显著差异。遵循二分分裂的多阶段自适应组测试流程，我们在每一步将待搜索的项集一分为二，并在越来越小的子集上进行点积测试，从而能对其中许多子集进行剪枝。实验表明，在多种大规模数据集上，本方法相比穷举搜索实现了超过十倍的加速比，同时保持与穷举搜索相同的精度。我们给出了每查询预期距离计算次数以及包含特定成员数量的池将被剪枝的概率的理论分析。通过这种方式，本方法利用了非常实用且具实际意义的数据分布特性，而这正是其他方法所不具备的。在本方法中，所有必要的数据结构均离线构建。此外，本方法不对真实近邻数量施加强假设，能适应动态向数据库新增向量的流式场景，且无需任何参数调优。