ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force

A minimal perfect hash function (MPHF) maps a set $S$ of $n$ keys to the first $n$ integers without collisions. There is a lower bound of $n\log_2e-O(\log n)$ bits of space needed to represent an MPHF. A matching upper bound is obtained using the brute-force algorithm that tries random hash functions until stumbling on an MPHF and stores that function's seed. In expectation, $e^n\textrm{poly}(n)$ seeds need to be tested. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables. ShockHash uses two hash functions $h_0$ and $h_1$, hoping for the existence of a function $f : S \rightarrow \{0,1\}$ such that $x \mapsto h_{f(x)}(x)$ is an MPHF on $S$. In graph terminology, ShockHash generates $n$-edge random graphs until stumbling on a pseudoforest - a graph where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval data structure to store $f$ using $n + o(n)$ bits. By carefully analyzing the probability that a random graph is a pseudoforest, we show that ShockHash needs to try only $(e/2)^n\textrm{poly}(n)$ hash function seeds in expectation, reducing the space for storing the seed by roughly $n$ bits. This makes ShockHash almost a factor $2^n$ faster than brute-force, while maintaining the asymptotically optimal space consumption. An implementation within the RecSplit framework yields the currently most space efficient MPHFs, i.e., competing approaches need about two orders of magnitude more work to achieve the same space.

翻译：最小完美哈希函数（MPHF）可将包含 $n$ 个键的集合 $S$ 无冲突地映射到前 $n$ 个整数。表示一个 MPHF 所需的空间存在下界 $n\log_2e-O(\log n)$ 比特。通过使用暴力搜索算法——即反复尝试随机哈希函数直至找到 MPHF 并存储该函数的种子——可获得匹配的上界。期望情况下需测试 $e^n\textrm{poly}(n)$ 个种子。此前空间效率最高的 MPHF 构造算法均以这种暴力搜索方法为基本构建模块。本文提出 ShockHash——小型高负载布谷鸟哈希表。ShockHash 使用两个哈希函数 $h_0$ 和 $h_1$，期望存在函数 $f : S \rightarrow \{0,1\}$ 使得映射 $x \mapsto h_{f(x)}(x)$ 成为集合 $S$ 上的 MPHF。在图论术语中，ShockHash 生成 $n$ 条边的随机图，直至找到伪森林——即每个连通分量中边数与节点数相等的图。随后利用布谷鸟哈希，在线性时间内从该伪森林推导出 MPHF。它采用 1 比特检索数据结构存储 $f$，仅需 $n + o(n)$ 比特。通过仔细分析随机图为伪森林的概率，我们证明 ShockHash 期望仅需尝试 $(e/2)^n\textrm{poly}(n)$ 个哈希函数种子，从而将存储种子的空间减少约 $n$ 比特。这使得 ShockHash 的速度比暴力搜索快近 $2^n$ 倍，同时保持渐近最优的空间消耗。在 RecSplit 框架内的实现产生了当前空间效率最高的 MPHF——即竞争方法需付出约两个数量级的工作量才能达到相同空间性能。