Minimal perfect hashing is the problem of mapping a static set of $n$ distinct keys into the address space $\{1,\ldots,n\}$ bijectively. It is well-known that $n\log_2(e)$ bits are necessary to specify a minimal perfect hash function (MPHF) $f$, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of $f$. For example, consider a string and the set of all its distinct $k$-mers as input keys: since two consecutive $k$-mers share an overlap of $k-1$ symbols, it seems possible to beat the classic $\log_2(e)$ bits/key barrier in this case. Moreover, we would like $f$ to map consecutive $k$-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for $f$, resulting in a better evaluation time when querying consecutive $k$-mers. Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for $k$-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing $k$ and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.
翻译:最小完美哈希是将一组静态的$n$个不同键双射映射到地址空间$\{1,\ldots,n\}$的问题。众所周知,当不利用输入键的额外信息时,指定一个最小完美哈希函数(MPHF)$f$需要$n\log_2(e)$比特。然而在实际中,输入键往往具有内在关联性,我们可以利用这些关联来降低$f$的比特复杂度。例如,考虑一个字符串及其所有不同$k$-mer作为输入键:由于两个连续$k$-mer共享$k-1$个符号的重叠,在这种情况下似乎可能突破经典的$\log_2(e)$比特/键的界限。此外,我们希望$f$能将连续的$k$-mer映射到连续的地址,从而在值域中尽可能保留它们的关联性。这在实践中是一个有用的特性,因为它保证了$f$具有某种程度的局部引用性,从而在查询连续$k$-mer时获得更优的评估时间。基于这些前提,我们启动了一类新型保持邻接性MPHF的研究,该函数专为从字符串集合中连续提取的$k$-mer设计。我们设计了一种空间使用量随$k$增大而减少的构建方法,并讨论了该方法的实际实现实验:在实践中,使用我们的方法构建的函数在空间上可比文献中最有效的MPHF小数倍,甚至查询速度更快。