Count-Min Sketch with Conservative Updates (\texttt{CMS-CU}) is a memory-efficient hash-based data structure used to estimate the occurrences of items within a data stream. \texttt{CMS-CU} stores~$m$ counters and employs~$d$ hash functions to map items to these counters. We first argue that the estimation error in \texttt{CMS-CU} is maximal when each item appears at most once in the stream. Next, we study \texttt{CMS-CU} in this setting. Precisely, \begin{enumerate} \item In the case where~$d=m-1$, we prove that the average estimation error and the average counter rate converge almost surely to~$\frac{1}{2}$, contrasting with the vanilla Count-Min Sketch, where the average counter rate is equal to~$\frac{m-1}{m}$. \item For any given~$m$ and~$d$, we prove novel lower and upper bounds on the average estimation error, incorporating a positive integer parameter~$g$. Larger values of this parameter improve the accuracy of the bounds. Moreover, the computation of each bound involves examining an ergodic Markov process with a state space of size~$\binom{m+g-d}{g}$ and a sparse transition probabilities matrix containing~$\mathcal{O}(m\binom{m+g-d}{g})$ non-zero entries. \item For~$d=m-1$, $g=1$, and as $m\to \infty$, we show that the lower and upper bounds coincide. In general, our bounds exhibit high accuracy for small values of $g$, as shown by numerical computation. For example, for $m=50$, $d=4$, and $g=5$, the difference between the lower and upper bounds is smaller than~$10^{-4}$. \end{enumerate}
翻译:带保守更新的Count-Min Sketch(\texttt{CMS-CU})是一种内存高效的基于哈希的数据结构,用于估计数据流中项目的出现次数。\texttt{CMS-CU} 存储~$m$ 个计数器,并使用~$d$ 个哈希函数将项目映射到这些计数器。我们首先论证,当每个项目在流中最多出现一次时,\texttt{CMS-CU} 的估计误差达到最大值。接着,我们研究此设置下的 \texttt{CMS-CU}。具体而言:\begin{enumerate} \item 在~$d=m-1$ 的情况下,我们证明平均估计误差和平均计数器比率几乎必然收敛到~$\frac{1}{2}$,这与普通 Count-Min Sketch 形成对比,后者的平均计数器比率等于~$\frac{m-1}{m}$。\item 对于任意给定的~$m$ 和~$d$,我们证明了平均估计误差的新颖下界和上界,其中包含一个正整数参数~$g$。该参数的值越大,边界的精度越高。此外,每个边界的计算涉及分析一个遍历马尔可夫过程,其状态空间大小为~$\binom{m+g-d}{g}$,转移概率矩阵稀疏,包含~$\mathcal{O}(m\binom{m+g-d}{g})$ 个非零项。\item 对于~$d=m-1$、$g=1$ 且~$m\to \infty$ 的情况,我们证明下界和上界重合。一般而言,如数值计算所示,我们的边界在 $g$ 值较小时表现出高精度。例如,当 $m=50$、$d=4$ 且 $g=5$ 时,下界与上界之间的差异小于~$10^{-4}$。\end{enumerate}