Conservative Count-Min, an improved version of Count-Min sketch [Cormode, Muthukrishnan 2005], is an online-maintained hashing-based data structure summarizing element frequency information without storing elements themselves. Although several works attempted to analyze the error that can be made by Count-Min, the behavior of this data structure remains poorly understood. In [Fusy, Kucherov 2022], we demonstrated that under the uniform distribution of input elements, the error of conservative Count-Min follows two distinct regimes depending on its load factor. In this work, we provide a series of experimental results providing new insights into the behavior of conservative Count-Min. Our contributions can be seen as twofold. On one hand, we provide a detailed experimental analysis of the behavior of Count-Min sketch in different regimes and under several representative probability distributions of input elements. On the other hand, we demonstrate improvements that can be made by assigning a variable number of hash functions to different elements. This includes, in particular, reduced space of the data structure while still supporting a small error.
翻译:保守型Count-Min(Count-Min草图的改进版本,[Cormode, Muthukrishnan 2005])是一种基于哈希的在线维护数据结构,用于在不存储元素本身的情况下汇总元素频率信息。尽管已有研究尝试分析Count-Min可能产生的误差,但其行为仍未得到充分理解。在[Fusy, Kucherov 2022]中,我们证明在输入元素均匀分布条件下,保守型Count-Min的误差根据其负载因子呈现两种不同模式。本研究通过系列实验为保守型Count-Min的行为提供新见解。我们的贡献可概括为两方面:一方面,我们详细实验分析了Count-Min草图在不同模式及若干代表性输入元素概率分布下的行为;另一方面,我们展示了通过为不同元素分配可变数量哈希函数所能实现的改进,特别包括在保持较低误差的同时缩减数据结构空间。