A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage. We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor.
翻译:单调最小完美哈希函数(MMPHF)构建于键集合S上,是一种将S中每个键映射到其秩的函数。对于不在S中的键,该函数返回任意值。其应用范围涵盖数据库、搜索引擎、数据加密及模式匹配算法。本文提出LeMonHash,一种针对整数的新型MMPHF构建技术。LeMonHash的核心思想出奇地简单而有效:我们通过误差有界分段线性模型(PGM索引)学习从键到其秩的单调映射,然后通过检索数据结构(BuRR)将小整数与键关联,解决可能出现的映射到相同秩估计的键之间的冲突。在合成随机数据集上,LeMonHash所需空间比次优竞品少34%,同时查询速度快约16倍。在真实数据集上,其空间占用与最优竞品非常接近或显著更优,查询速度比次优竞品快高达19倍。就LeMonHash的构建而言,与空间占用次优的竞品相比,我们实现了高达2倍的性能提升。我们还研究了键为可变长度字符串的情况,提出所谓的LeMonHash-VL:其空间占用在最优竞品的13%以内,而查询速度比次优竞品快高达3倍。