Count-Min Sketch with Conservative Updates (CMS-CU) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. CMS-CU stores $m$ counters and employs $d$ hash functions to map items to these counters. We first argue that the estimation error in CMS-CU is maximal when each item appears at most once in the stream. Next, we study CMS-CU in this setting. In the case where $d=m-1$, we prove that the average estimation error and the average counter rate converge almost surely to $\frac{1}{2}$, contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to $\frac{m-1}{m}$. For any given $m$ and $d$, we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter $g$. Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size $\binom{m+g-d}{g}$ and a sparse transition probabilities matrix containing $\mathcal{O}(m\binom{m+g-d}{g})$ non-zero entries. For $d=m-1$, $g=1$, and as $m\to \infty$, we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of $g$, as shown by numerical computation. For example, for $m=50$, $d=4$, and $g=5$, the difference between the lower and upper bounds is smaller than $10^{-4}$.
翻译:采用保守更新的Count-Min Sketch(CMS-CU)是一种内存高效的哈希数据结构,用于估计数据流中项的出现次数。CMS-CU存储$m$个计数器,并使用$d$个哈希函数将项映射到这些计数器。我们首先论证,当每个项在数据流中至多出现一次时,CMS-CU的估计误差达到最大值。接着,我们在此设定下研究CMS-CU。在$d=m-1$的情况下,我们证明平均估计误差和平均计数器率几乎必然收敛到$\frac{1}{2}$,这与标准的Count-Min Sketch形成对比——后者的平均计数器率等于$\frac{m-1}{m}$。对于任意给定的$m$和$d$,我们证明了关于平均估计误差的新颖下界和上界,其中包含一个正整数参数$g$。该参数的值越大,边界的精度越高。此外,每个边界的计算涉及分析一个状态空间大小为$\binom{m+g-d}{g}$的遍历马尔可夫过程,其转移概率矩阵稀疏且包含$\mathcal{O}(m\binom{m+g-d}{g})$个非零元素。对于$d=m-1$、$g=1$且$m\to \infty$的情况,我们表明下界和上界趋于一致。总体而言,如数值计算所示,我们的边界在$g$取值较小的情况下具有高精度。例如,当$m=50$、$d=4$、$g=5$时,下界与上界之间的差值小于$10^{-4}$。