Data similarity (or distance) computation is a fundamental research topic which fosters a variety of similarity-based machine learning and data mining applications. In big data analytics, it is impractical to compute the exact similarity of data instances due to high computational cost. To this end, the Locality Sensitive Hashing (LSH) technique has been proposed to provide accurate estimators for various similarity measures between sets or vectors in an efficient manner without the learning process. Structured data (e.g., sequences, trees and graphs), which are composed of elements and relations between the elements, are commonly seen in the real world, but the traditional LSH algorithms cannot preserve the structure information represented as relations between elements. In order to conquer the issue, researchers have been devoted to the family of the hierarchical LSH algorithms. In this paper, we explore the present progress of the research into hierarchical LSH from the following perspectives: 1) Data structures, where we review various hierarchical LSH algorithms for three typical data structures and uncover their inherent connections; 2) Applications, where we review the hierarchical LSH algorithms in multiple application scenarios; 3) Challenges, where we discuss some potential challenges as future directions.
翻译:数据相似性(或距离)计算是一个基础研究课题,它催生了多种基于相似性的机器学习和数据挖掘应用。在大数据分析中,由于高昂的计算成本,计算数据实例间的精确相似性往往不切实际。为此,局部敏感哈希(LSH)技术被提出,旨在无需学习过程即可高效地为集合或向量间的多种相似性度量提供准确估计。由元素及元素间关系构成的结构化数据(例如序列、树和图)在现实世界中十分常见,但传统的LSH算法无法有效保持以元素间关系表示的结构信息。为解决这一问题,研究者们致力于发展层次化LSH算法族。本文从以下角度探讨层次化LSH的研究现状:1)数据结构,回顾针对三种典型数据结构的各类层次化LSH算法,并揭示其内在联系;2)应用场景,综述层次化LSH算法在多种应用场景中的实践;3)挑战与展望,探讨未来研究可能面临的潜在挑战与发展方向。