PHOBIC: Perfect Hashing with Optimized Bucket Sizes and Interleaved Coding

A minimal perfect hash function (MPHF) maps a set of n keys to {1, ..., n} without collisions. Such functions find widespread application e.g. in bioinformatics and databases. In this paper we revisit PTHash - a construction technique particularly designed for fast queries. PTHash distributes the input keys into small buckets and, for each bucket, it searches for a hash function seed that places its keys in the output domain without collisions. The collection of all seeds is then stored in a compressed way. Since the first buckets are easier to place, buckets are considered in non-increasing order of size. Additionally, PTHash heuristically produces an imbalanced distribution of bucket sizes by distributing 60% of the keys into 30% of the buckets. Our main contribution is to characterize, up to lower order terms, an optimal distribution of expected bucket sizes. We arrive at a simple, closed form solution which improves construction throughput for space efficient configurations in practice. Our second contribution is a novel encoding scheme for the seeds. We split the keys into partitions. Within each partition, we run the bucket distribution and search step. We then store the seeds in an interleaved way by consecutively placing the seeds for the i-th buckets from all partitions. The seeds for the i-th bucket of each partition follow the same statistical distribution. This allows us to tune a compressor for each bucket. Hence, we call our technique PHOBIC - Perfect Hashing with Optimized Bucket sizes and Interleaved Coding. Compared to PTHash, PHOBIC is 0.17 bits/key more space efficient for same query time and construction throughput. We also contribute a GPU implementation to further accelerate MPHF construction. For a configuration with fast queries, PHOBIC-GPU can construct a perfect hash function at 2.17 bits/key in 28 ns per key, which can be queried in 37 ns on the CPU.

翻译：最小完美哈希函数（MPHF）将n个键的集合无冲突地映射到{1, ..., n}。此类函数广泛应用于生物信息学和数据库等领域。本文重新审视了PTHash——一种专为快速查询设计的构建技术。PTHash将输入键分布到小桶中，并为每个桶搜索一个哈希函数种子，使其键在输出域中无冲突排列。所有种子的集合随后以压缩方式存储。由于首个桶更易放置，桶按尺寸非递增顺序处理。此外，PTHash通过将60%的键分配到30%的桶中启发式地生成不均衡的桶尺寸分布。我们的主要贡献是表征了期望桶尺寸的最优分布（忽略低阶项），并得出一个简洁的闭式解，该解在实践中提升了空间高效配置的构建吞吐量。第二项贡献是一种新颖的种子编码方案：我们将键划分为多个分区，在每个分区内执行桶分布与搜索步骤，然后以交错方式存储种子——即连续放置所有分区中第i个桶的种子。各分区第i个桶的种子遵循相同的统计分布，这使我们能为每个桶调节压缩器。因此，我们将该方法命名为PHOBIC（优化桶大小与交错编码的完美哈希）。与PTHash相比，在相同查询时间和构建吞吐量下，PHOBIC每键节省0.17比特空间。我们还贡献了GPU实现以进一步加速MPHF构建。在快速查询配置中，PHOBIC-GPU能以每键28纳秒的速度构建每键2.17比特的完美哈希函数，且在CPU上查询耗时37纳秒。