Data similarity (or distance) computation is a fundamental research topic which fosters a variety of similarity-based machine learning and data mining applications. In big data analytics, it is impractical to compute the exact similarity of data instances due to high computational cost. To this end, the Locality Sensitive Hashing (LSH) technique has been proposed to provide accurate estimators for various similarity measures between sets or vectors in an efficient manner without the learning process. Structured data (e.g., sequences, trees and graphs), which are composed of elements and relations between the elements, are commonly seen in the real world, but the traditional LSH algorithms cannot preserve the structure information represented as relations between elements. In order to conquer the issue, researchers have been devoted to the family of the hierarchical LSH algorithms. In this paper, we explore the present progress of the research into hierarchical LSH from the following perspectives: 1) Data structures, where we review various hierarchical LSH algorithms for three typical data structures and uncover their inherent connections; 2) Applications, where we review the hierarchical LSH algorithms in multiple application scenarios; 3) Challenges, where we discuss some potential challenges as future directions.
翻译:数据相似性(或距离)计算是推动多种基于相似性的机器学习与数据挖掘应用的基础研究课题。在大数据分析中,由于计算成本过高,通常无法精确计算数据实例间的相似性。为此,局部敏感哈希(LSH)技术被提出,用于在不经过学习过程的情况下,高效地为集合或向量之间的多种相似性度量提供精确估计。由元素及元素间关系构成的结构化数据(如序列、树和图)在现实世界中普遍存在,但传统LSH算法无法保留作为元素间关系表征的结构信息。为解决该问题,研究者致力于层级LSH算法族的研究。本文从以下视角探讨层级LSH的研究进展:1)数据结构,梳理针对三种典型数据结构的各类层级LSH算法,揭示其内在联系;2)应用场景,评述层级LSH算法在多种应用场景中的实践;3)挑战挑战,讨论未来研究方向中存在的潜在挑战。