Hashing is a common technique used in data processing, with a strong impact on the time and resources spent on computation. Hashing also affects the applicability of theoretical results that often assume access to (unrealistic) uniform/fully-random hash functions. In this paper, we are concerned with designing hash functions that are practical and come with strong theoretical guarantees on their performance. To this end, we present tornado tabulation hashing, which is simple, fast, and exhibits a certain full, local randomness property that provably makes diverse algorithms perform almost as if (abstract) fully-random hashing was used. For example, this includes classic linear probing, the widely used HyperLogLog algorithm of Flajolet, Fusy, Gandouet, Meunier [AOFA 97] for counting distinct elements, and the one-permutation hashing of Li, Owen, and Zhang [NIPS 12] for large-scale machine learning. We also provide a very efficient solution for the classical problem of obtaining fully-random hashing on a fixed (but unknown to the hash function) set of $n$ keys using $O(n)$ space. As a consequence, we get more efficient implementations of the splitting trick of Dietzfelbinger and Rink [ICALP'09] and the succinct space uniform hashing of Pagh and Pagh [SICOMP'08]. Tornado tabulation hashing is based on a simple method to systematically break dependencies in tabulation-based hashing techniques.
翻译:哈希是一种常用的数据处理技术,对计算时间和资源消耗有重大影响。哈希也影响着许多理论结果的适用性——这些结果通常假设能够使用(不切实际的)均匀/完全随机哈希函数。本文旨在设计兼具实用性与强理论性能保证的哈希函数。为此,我们提出龙卷风制表哈希(tornado tabulation hashing),该方法简洁高效,并展现出某种完全的局部随机性特性——理论上可以证明,该特性能使各类算法几乎达到如同使用(抽象意义上的)完全随机哈希时的性能。例如,这包括经典线性探测法、Flajolet、Fusy、Gandouet、Meunier [AOFA 97] 提出的广泛用于基数估计的HyperLogLog算法,以及Li、Owen、Zhang [NIPS 12] 用于大规模机器学习的单排列哈希。我们还针对经典问题——在固定(但哈希函数未知)的包含$n$个键的键集上实现完全随机哈希——给出了仅需$O(n)$空间的高效解决方案。由此,我们实现了Dietzfelbinger与Rink [ICALP'09] 的分裂技巧以及Pagh与Pagh [SICOMP'08] 的紧凑空间均匀哈希的更高效实现。龙卷风制表哈希基于一种系统化打破制表哈希技术中依赖关系的简洁方法。