Locality-sensitive hashing (LSH) is an effective randomized technique widely used in many machine learning tasks. The cost of hashing is proportional to data dimensions, and thus often the performance bottleneck when dimensionality is high and the number of hash functions involved is large. Surprisingly, however, little work has been done to improve the efficiency of LSH computation. In this paper, we design a simple yet efficient LSH scheme, named FastLSH, under l2 norm. By combining random sampling and random projection, FastLSH reduces the time complexity from O(n) to O(m) (m<n), where n is the data dimensionality and m is the number of sampled dimensions. Moreover, FastLSH has provable LSH property, which distinguishes it from the non-LSH fast sketches. We conduct comprehensive experiments over a collection of real and synthetic datasets for the nearest neighbor search task. Experimental results demonstrate that FastLSH is on par with the state-of-the-arts in terms of answer quality, space occupation and query efficiency, while enjoying up to 80x speedup in hash function evaluation. We believe that FastLSH is a promising alternative to the classic LSH scheme.
翻译:局部敏感哈希(LSH)是一种有效的随机化技术,广泛应用于许多机器学习任务中。哈希计算成本与数据维度成正比,因此在高维场景及涉及大量哈希函数时,该计算常成为性能瓶颈。然而令人惊讶的是,目前鲜有研究致力于提升LSH的计算效率。本文针对l2范数设计了一种简单高效的LSH方案,命名为FastLSH。通过结合随机采样与随机投影,FastLSH将时间复杂度从O(n)降至O(m)(m<n),其中n为数据维度,m为采样维度数。此外,FastLSH具有可证明的LSH性质,使其区别于非LSH类快速素描方法。我们在真实与合成数据集上针对近邻搜索任务开展了全面实验。实验结果表明:FastLSH在答案质量、空间占用及查询效率方面均与当前最优方法相当,同时哈希函数评估速度最高可提升80倍。我们认为FastLSH是经典LSH方案的一种极具前景的替代方案。