Elastic-Sketch is a hash-based data structure for counting item's appearances in a data stream, and it has been empirically shown to achieve a better memory-accuracy trade-off compared to classical methods. This algorithm combines a heavy block, which aims to maintain exact counts for a small set of dynamically elected items, with a light block that implements Count-Min Sketch (CM) for summarizing the remaining traffic. The heavy block dynamics are governed by a hash function $β$ that hashes items into $m_1$ buckets, and an eviction threshold $λ$, which controls how easily an elected item can be replaced. We show that the performance of Elastic-Sketch strongly depends on the stream characteristics and the choice of $λ$. Since optimal parameter choices depend on unknown stream properties, we analyze Elastic-Sketch under a stationary random stream model -- a common assumption that captures the statistical regularities observed in real workloads. Formally, as the stream length goes to infinity, we derive closed-form expressions for the limiting distribution of the counters and the resulting expected counting error. These expressions are efficiently computable, enabling practical grid-based tuning of the heavy and CM blocks memory split (via $m_1$) and the eviction threshold $λ$. We further characterize the structure of the optimal eviction threshold, substantially reducing the search space and showing how this threshold depends on the arrival distribution. Extensive numerical simulations validate our asymptotic results on finite streams from the Zipf distribution.
翻译:弹性草图是一种基于哈希的数据结构,用于统计数据流中项目的出现次数,实验表明与经典方法相比,它在内存与精度权衡方面表现更优。该算法包含一个重型块和一个轻型块:重型块旨在为动态选举的少量项目维护精确计数,而轻型块则采用计数-最小草图(Count-Min Sketch, CM)汇总剩余流量。重型块的动态行为由哈希函数 $\beta$(将项目散列到 $m_1$ 个桶中)和驱逐阈值 $\lambda$(控制已选举项目被替换的难易程度)共同决定。我们证明弹性草图的性能强烈依赖于数据流特征和 $\lambda$ 的选择。由于最优参数选择依赖于未知的数据流属性,我们在平稳随机数据流模型(一种捕捉真实工作负载中统计规律性的常见假设)下分析弹性草图。形式化地,当数据流长度趋于无穷时,我们推导出计数器极限分布以及由此产生的期望计数误差的闭合表达式。这些表达式可高效计算,从而实现对重型块和CM块内存分配(通过 $m_1$)及驱逐阈值 $\lambda$ 的实用网格化调优。我们进一步刻画了最优驱逐阈值的结构,大幅缩减了搜索空间,并揭示该阈值如何依赖于到达分布。大量数值模拟验证了我们在Zipf分布有限数据流上的渐近结果。