Hashing is a common technique used in data processing, with a strong impact on the time and resources spent on computation. Hashing also affects the applicability of theoretical results that often assume access to (unrealistic) uniform/fully-random hash functions. In this paper, we are concerned with designing hash functions that are practical and come with strong theoretical guarantees on their performance. To this end, we present tornado tabulation hashing, which is simple, fast, and exhibits a certain full, local randomness property that provably makes diverse algorithms perform almost as if (abstract) fully-random hashing was used. For example, this includes classic linear probing, the widely used HyperLogLog algorithm of Flajolet, Fusy, Gandouet, Meunier [AOFA 97] for counting distinct elements, and the one-permutation hashing of Li, Owen, and Zhang [NIPS 12] for large-scale machine learning. We also provide a very efficient solution for the classical problem of obtaining fully-random hashing on a fixed (but unknown to the hash function) set of $n$ keys using $O(n)$ space. As a consequence, we get more efficient implementations of the splitting trick of Dietzfelbinger and Rink [ICALP'09] and the succinct space uniform hashing of Pagh and Pagh [SICOMP'08]. Tornado tabulation hashing is based on a simple method to systematically break dependencies in tabulation-based hashing techniques.
翻译:哈希是一种常见的数据处理技术,对计算时间和资源消耗具有重要影响。哈希也影响着理论结果的可应用性,这些理论结果通常假设使用(不切实际的)均匀/全随机哈希函数。本文致力于设计实用且具有强理论性能保证的哈希函数。为此,我们提出龙卷风制表哈希(tornado tabulation hashing),该方法简单、快速,并展现出某种完整的局部随机性性质,可被证明能使多种算法几乎像使用(抽象的)全随机哈希一样运行。例如,这包括经典线性探测法、Flajolet、Fusy、Gandouet、Meunier [AOFA 97] 提出的用于统计不同元素的广泛使用的HyperLogLog算法,以及Li、Owen和Zhang [NIPS 12] 提出的大规模机器学习中的单排列哈希。针对在固定(但哈希函数未知)的n个键集合上使用O(n)空间实现全随机哈希这一经典问题,我们还提供了一种极其高效的解决方案。由此,我们得到了Dietzfelbinger和Rink [ICALP'09] 的分裂技巧与Pagh和Pagh [SICOMP'08] 的简洁空间均匀哈希的更高效实现。龙卷风制表哈希基于一种系统性打破制表哈希技术中依赖关系的简单方法。