Explicit Min-wise Hash Families with Optimal Size

We study explicit constructions of min-wise hash families and their extension to $k$-min-wise hash families. Informally, a min-wise hash family guarantees that for any fixed subset $X\subseteq[N]$, every element in $X$ has an equal chance to have the smallest value among all elements in $X$; a $k$-min-wise hash family guarantees this for every subset of size $k$ in $X$. Min-wise hash is widely used in many areas of computer science such as sketching, web page detection, and $\ell_0$ sampling. The classical works by Indyk and P\u{a}tra\c{s}cu and Thorup have shown $\Theta(\log(1/\delta))$-wise independent families give min-wise hash of multiplicative (relative) error $\delta$, resulting in a construction with $\Theta(\log(1/\delta)\log N)$ random bits. Based on a reduction from pseudorandom generators for combinatorial rectangles by Saks, Srinivasan, Zhou and Zuckerman, Gopalan and Yehudayoff improved the number of bits to $O(\log N\log\log N)$ for polynomially small errors $\delta$. However, no construction with $O(\log N)$ bits (polynomial size family) and sub-constant error was known before. In this work, we continue and extend the study of constructing ($k$-)min-wise hash families from pseudorandomness for combinatorial rectangles and read-once branching programs. Our main result gives the first explicit min-wise hash families that use an optimal (up to constant) number of random bits and achieve a sub-constant (in fact, almost polynomially small) error, specifically, an explicit family of $k$-min-wise hash with $O(k\log N)$ bits and $2^{-O(\log N/\log\log N)}$ error. This improves all previous results for any $k=\log^{O(1)}N$ under $O(k \log N)$ bits. Our main techniques involve several new ideas to adapt the classical Nisan-Zuckerman pseudorandom generator to fool min-wise hashing with a multiplicative error.

翻译：本文研究了最小哈希族及其扩展——$k$-最小哈希族的显式构造。简而言之，最小哈希族保证对于任意固定子集$X\subseteq[N]$，$X$中的每个元素都有均等机会成为$X$中取值最小的元素；$k$-最小哈希族则对$X$中任意大小为$k$的子集保证该性质。最小哈希在计算机科学的诸多领域有广泛应用，如草图算法、网页检测和$\ell_0$采样等。Indyk与Pătrașcu及Thorup的经典研究表明，$\Theta(\log(1/\delta))$阶独立族可构造具有乘法（相对）误差$\delta$的最小哈希，所需随机比特数为$\Theta(\log(1/\delta)\log N)$。基于Saks、Srinivasan、Zhou和Zuckerman对组合矩形伪随机生成器的归约方法，Gopalan与Yehudayoff将多项式小误差$\delta$下的比特数改进为$O(\log N\log\log N)$。然而，此前尚未存在以$O(\log N)$比特（多项式规模族）实现亚常数误差的构造。本工作延续并拓展了基于组合矩形与一次读取分支程序伪随机性构造（$k$‑）最小哈希族的研究。我们的主要成果首次给出了使用最优（至常数倍）随机比特数且达到亚常数（实际为近多项式小）误差的显式最小哈希族，具体而言：构造了具有$O(k\log N)$比特与$2^{-O(\log N/\log\log N)}$误差的显式$k$-最小哈希族。该结果在$O(k \log N)$比特条件下，对所有$k=\log^{O(1)}N$的情况改进了先前所有研究。核心技术包含多项创新思路，通过改造经典Nisan-Zuckerman伪随机生成器，使其能够以乘法误差欺骗最小哈希判定。